posted in: Management | 0

Do modern information systems, which provide management with computer-generated indexes of performance, help managers to make sound judgments about an organization’s effectiveness? An extensive study made by the authors in a large organization raises serious doubts. The evidence from the study indicates that informed managers still rely much more on qualitative than quantitative criteria in appraising performance, even when the quantitative measures are available and in use. Furthermore, those managers who use more subjective data tend to agree more with one another than those who depend on highly quantified information. The findings have significant implications for management training and for the planning of information systems.

Top managers are currently inundated with reams of information concerning the performance of organizational units under their supervision. Behind this information explosion lies a seemingly logical assumption made by information specialists and frequently accepted by line managers: if top management can be supplied with more “objective” and “accurate” quantified information, they will make “better” judgments about the performance of their operating units.

But how valid is this assumption? A research study we have recently completed indicates that quantified performance information may have a more limited role than is currently assumed or envisioned; in fact, managers rely more on subjective information than they do on so-called “objective” statistics in assessing the overall performance of lower-level units.

The Human Factor

Despite the increasing desire for and the trend toward the quantification of performance results, most managers are the first to admit that their performance assessments begin and end with human judgment. Managers determine what kinds of performance information to collect; and later, when this information is in front of them, they study the data, make up their minds, and decide what to do. For example, here is how a vice president of finance in a large steel company describes the importance of human judgment in assessing organization performance: “We have capital budgets, we have fixed budgets, we have variable budgets, we have standard costs. We have the full assortment. None of our controls is 100% effective. None replaces subjective judgment.”

There are several reasons why managerial judgment is critical in evaluating the performance of lower-level units in today’s organizations:
• Organizations are becoming so large and complex that, as yet, no universal mathematical formulas exist for encompassing all the variables that reflect the total performance of a particular subunit.
• Events within and around an organization change so rapidly that indicators of “good” performance last year may not be appropriate this year.
• Managers are frequently aware of nonquantifiable factors that information specialists don’t usually hear about or record on their standard forms.
• Ultimately, it is managers, not computers, that must make decisions based on their assessments of performance. If a manager makes a “biased” judgment of performance, then his actions are also likely to be “biased.”

In this article, we shall describe the purpose and methods of our study. Then we will present the major findings, which at times are coupled with some broad implications for management. Finally, we shall conclude with more specific suggestions for improving the quality of performance judgments in organizations.

In particular, we shall consider these important questions that bear on the judgmental process:
• How important relatively are quantitative and qualitative criteria to managers in making judgments of performance? (If quantitative criteria are less important than assumed, then the organization may be able to redirect the activities of computers and staff analysts to use them more efficiently.)
• Are managers consistent from one day to the next in their judgments?
• Do managers agree more on current effectiveness than they do on changes in effectiveness over time?
• Can managers actually agree with each other in assessing the performance of organizational units beneath them? (If they can’t, then they are likely to give off conflicting signals to lower levels.)
• Must managers agree on specific criteria if they are to agree in their overall judgments about performance?
• Does a manager’s position in the organization affect his judgments of performance? (If it does, misunderstandings among managerial levels are likely to ensue.)

How the Study Was Made

We conducted our investigation in a large government agency, the Internal Revenue Service. Although the IRS is not a business organization, it does contain many elements common to most large, complex organizations. More than 74,000 people are employed by the IRS in either the national office headquarters or the 7 regional and 58 district offices, or the 7 computer service centers. The IRS organization is also divided functionally into several major divisions, with each division having a representative group at all regional and most district offices. The measurement of performance is a key concern of IRS managers; many statistical indicators of performance are regularly collected, including indexes of costs, output, and morale.

Top management in the IRS became interested in having a study made partly out of curiosity about the reliability of their performance assessments of district operations. Despite their access to many quantitative performance measures, they readily acknowledged that their overall assessments of district performance depended, in the final analysis, on subjective judgment. At the national office level, managers were interested in knowing if they agreed with their counterparts at the regional level, and vice versa. In addition, managers at all levels were curious about the degree to which they relied on quantitative versus qualitative information in forming their judgments.

The study focused on three types of performance evaluation mentioned most frequently in initial interviews with IRS managers at the national and regional levels:

1. The current effectiveness of divisional subunits within a district. (This is important because it provides early signs of problems in the making.)
2. The performance improvement of these units over the preceding 18 months. (This adds perspective to the judgments of current effectiveness, especially when a currently high-performing unit is slipping or a low-performing unit is improving.)
3. Specific reasons for a unit’s performance improvement, or lack of it. (This provides more precise clues for taking corrective action. For example, corrective action might be quite different if the reason were judged to be an uncontrollable “natural disaster” rather than a “lack of leadership.”)

To investigate how the IRS top managers evaluated these three performance dimensions, we formed 15 panels, representing 5 headquarters divisions and 10 regional divisions. Each panel consisted of 2 to 5 managers acting as judges. Insofar as possible, judges were selected for their familiarity with the performance of their particular divisions at the district level and for their “objectivity” in assessing subunit performance.

We have covered this method in some detail because, as we shall suggest later, other organizations may be interested in undertaking similar studies. (This is the first study, as far as we know, that has designed and utilized a methodology for systematically examining managerial judgments of subunit performance.)

When each judge had completed the procedure, her sent her data cards directly to us for analysis. Our analytical procedure was based largely on a mathematical technique called correlation analysis. First, we totaled the number of times that each judge chose one unit over another. From these totals we computed for each judge a rank order of those units evaluated by him; the unit chosen most over other units received a rank of “1,” and so forth

Current improvement
effectiveness over 18 months
Average correlation of
agreement within 15 panels +.76 +.71

Average correlation of
agreement between Day 1 and +.90 +.83
Day 3 judgments for 50 Judges

down the list. Then this rank order was correlated against the rank orders for the judges in each panel. This produced a “level of agreement” within each of the 15 panels.

Additional correlations were also computed between panels by comparing an overall rank order for each panel with the overall rank order for other panels. Perfect agreement between rank orders, as measured by a statistical correlation, was +1.00, while perfect disagreement was -1.00. There is no fixed rule for determining an “acceptable” correlation of agreement, although the following guideline tends to be commonly used in research:

+.9 or -.9 = high correlation
+.7 or -.7 = substantial correlation
+.5 or -.5 = moderate correlation
+.3 or -.3 = low correlation

A basic and critical question for any management is: Can our managers agree with one another about the performance of units under their supervision? Pessimists contend that managers cannot mentally assimilate and agree on all the complex performance data available, or that managers are such an idiosyncratic lot psychologically that it is impossible for them to agree with one another. On the other hand, optimists argue that managers are quite adept at simplifying complex information, being far more logical and objective than the pessimists might believe.

Our findings strongly support the optimists. While we cannot answer the agreement question for all organizations, the results do reveal that the particular managers we studied were, in general, able to reach a substantial level of agreement in their overall judgments of both current effectiveness and performance improvement. There was also a high correlation between their first-day and third-day judgments on both performance measurements.

The critical reader should ask, of course, if the high agreement was merely a product of common ” bias” among the judges within each panel. If such a bias did exist, this would be disturbing because IRS managers, while in overall agreement, could be making inaccurate assessments. One check on the amount of bias was to examine the extent of agreement between each divisional panel at the regional level and its counterpart panel at the national office. These panels all reported to different bosses and were separated by large physical distances. Under these conditions of limited authority and interaction, we felt that high agreement between national and regional office panels could not be explained in terms of a commonly shared bias.

The findings indeed revealed considerable agreement between regional and national office panels from each of the five divisions: an average of +.75 for their current effectiveness judgments and +(65 for their performance improvement judgments. Therefore, we think it reasonable to infer that common bias was not a strong contributing factor toward high agreement.

The improvement evaluation is obviously a complex assessment, which includes many subjective considerations and also requires a longer time perspective. Yet IRS managers seemed to find themselves on relatively the same historical wavelength. This finding is important because a manager’s aware­ ness of performance trends is often what tempers his action-taking plans. Lack of agreement about trends could produce not only inappropriate actions but also conflicting decisions from different managers.

At the same time, we should point out that a considerable range existed between panels with the lowest and highest levels of internal agreement. For current-effectiveness judgments, the lowest agreement panel had an internal correlation of +.16, while the highest agreement panel had +.99. For performance-improvement judgments, the internal correlation was +.10 for the lowest agreement panel and +1.00 for the highest panel. Thus, a large majority of panels revealed substantial internal agreement, while a few panels revealed much disagreement. This suggests the importance of discovering the factors that block agreement on some panels and the factors that cause high agreement on others.


We found two important organizational variables that seemed to distinguish between high- and low-agreement panels. The most potent variable appeared to be “organizational distance.” Exhibit JV shows not only that members of national office panels (two levels removed from districts) agreed less with one another in comparison with judges on regional office panels (one level removed), but also that their judgments were less stable from Day 1 to Day 3.

We prefer the term organizational distance to physical distance because the Region I office and its 12 districts were located within 600 miles of Washington, while the Region II office and its 9 districts were situated more than 1,300 miles away. Yet national office panels did not reach any more agreement about the closer Region I districts than they did about the more remote Region II districts. It appears that sitting close to the top of the organizational pyramid is not necessarily the easiest or best vantage point for assessing field unit performance. Undoubtedly, certain information disappears in the gap between levels, never reaching the top.

For us, these findings raise doubts about concentrating too much decision­ making power at the top of large organizations when the decision to be made is based on the evaluation of performance. They also cause one to question an overreliance on centralized information systems. Centralized systems, because of their remoteness and need for uniformity, may be particularly insensitive to what is happening in each field unit.


A second important, but less pronounced, organizational variable was the size of the functional division. Research reveals that panels from the two largest agency divisions (A and B) each of which employed more than twice as many people as any other division, reached lower levels of agreement. These large division Panels, regardless of their level in the organization, seemed to have particular difficulty assessing performance improvement over time.

Managers in large divisions are often physically and organizationally separated; they also become more specialized in their job functions. As a result, their communications are likely to be less frequent and conducted from narrower frames of reference. Further evidence of this communications breakdown was found in the fact that large division panels from the national office agreed with their counterpart panels at the regional offices only at the level of +.46 when judging performance improvement, while national and regional panels from small divisions agreed with each other at a much higher level, +.83, in judging improvement. Apparently, the communications pipeline between national and regional levels was more open in small divisions.

Both previous types of research make clear that the performance judgments of IRS managers are affected by their positions in the organizations. We suspect that the same findings apply to other large organizations as well. If a manager is located at headquarters, he is less likely to agree with his colleagues. In addition, if he is in a large division, he is less likely to agree not only with his peers at headquarters but also with managers at the next lower level in his division. Judgments of current effectiveness probably will not be as strongly affected by these organizational forces as are judgments of performance improvement.

Lower-level managers, because they can agree more with each other, may be able to teach higher-level managers a few of their trade secrets. Some clues to these trade secrets became more obvious when we focused on the specific criteria used by high-agreement panels.

Criteria for Judgment

While broader organizational forces (distance and size) produced variations in judgment, the specific criteria used by judges also contributed to differences in agreement.

1. Quality and effectiveness of management 4
2. Productivity measurements 1
3. . Manpower utilization 4
4. Overall Improvement, status quo, or decline 3
5. Inventory level of uncollected TDAs (taxpayer delinquent accounts)
6. Progress and achievement of established objectives end planned programs 4
7. Morale 4
8. Management participation and concern in local problem solving 5
9. Potentially available end use to which potential is put 4
10. Improved quality and composition end balance of fieldwork 5

• 1 =highly quantitative; 2=more quantitative than qualitative; 3=mixed quantitative and qualitative; 4=more qualitative than quantitative; and S=highly qualitative.

An analysis was made of the criteria filled out by judges on the reverse sides of their “performance improvement” cards.

An analysis was made of the criteria filled out by judges on the reverse sides of their “performance improvement” cards. As a first step, we arranged for an independent group of IRS analysts in Washington to categorize the criteria reported by the 50 judges. This group of analysts individually rated each of the reasons given by the judges on a 5-point scale: a rating of 4 or 5 was given to highly qualitative criteria, a 3 to mixed qualitative and quantitative criteria, and a 1 or 2 to highly quantitative criteria. One example of a qualitative criterion was “management is setting challenging goals,” while a quantitative one was “time spent per average case.”

The research above shows the ranking of the 10 most frequently mentioned criteria. The phrasing of these criteria was done by the IRS national office analysts, who inferred the categories from a variety of specific phrases found on the judges’ cards. Some categories are unique to IRS opera ions, but they indicate general types of criteria that could apply to other organizations as well. Important here is the fact that a large majority of items (7 out of 10) are qualitative (rated 4 or 5 on the rating scale), although two strongly quantitative criteria (rated 1) were also mentioned by the judges. The most important criterion was “quality and effectiveness of management; it was used by judges in 13 of the 15 panels.

From this initial categorizing process, we found that 92 different criteria were used by the entire group of judges. These criteria divided themselves into 39% qualitative, 22% mixed, and 39% quantitative, based on the ratings assigned by the IRS analysts. Of significant interest here is the fact that such a high percentage of qualitative criteria were used. The IRS devotes considerable manpower and money to quantifying performance results, yet these numerical results played a more limited role than we suspected.


Next, we compared the criteria used by four high-agreement panels (those panels with an internal correlation of +.84 or better) with the criteria used by four low-agreement panels (those with an internal correlation of +.30 or lower). Here we did not find any significant difference in the total number of criteria used; that is, low-agreement panels did not appear to be confusing themselves with too many criteria. High-agreement panels averaged 11.8 different criteria per panel, and low-agreement panels, 12.8. One high-agree­ment panel used as few as 7 criteria, while another used 20. Approximately the same range of total criteria (6 to 20) was found among low-agreement panels.


An important distinction was discovered in the extent of common criteria used by high-agreement panels. The study reveals that 44% of the criteria used in each high-agreement panel were commonly used by every judge within that panel. Only 12% of the criteria used in low-agreement panels were common to every judge.

We conclude from Exhibit VII that lack of agreement about specific criteria probably results in a lack of agreement about the overall performance of a unit. At the same time, we should point out that “perfect” agreement on specific criteria is not essential; a high percentage of criteria (41%) were unique to individual judges in high-agreement panels. Apparently, many judges took somewhat different reasoning paths to arrive at essentially the same end result. A broad implication here is that, while managements should work toward agreement on criteria for evaluating overall performance, they should also leave some latitude for each manager to select his own reasons. All too many managements spend endless meeting hours trying to agree on a limited number of criteria. According to our findings, this costly and often frustrating task may not be necessary.


One very significant finding was that high-agreement panels used considerably more qualitative than quantitative criteria in making their decisions. The study shows that 69% of the criteria used by high-agreement panels were qualitative, compared to only 20% for low-agreement panels. Low-agreement panels used 68% quantitative criteria yet could reach only an overall agreement level of no better than +.31. Furthermore, we found that the only criteria that low-agreement panels could completely agree on were quantitative criteria; there was far less agreement on qualitative criteria. This suggests that if managers want to be more in line with their colleagues in assessing total performance, they need not only to use a greater proportion of qualitative criteria but also develop more consensus on the specific qualitative criteria to be used.

Why would qualitative evidence be relied on so heavily by high-agreement panels, and why might these criteria lead them to a greater overall agreement? As we interpret our findings, there are at least three reasons:

1. Qualitative factors probably give more concrete, more sensitive, and more current clues to the underlying strengths of a unit; whereas statistics, despite their apparent preciseness, are usually averages or aggregates that fail to portray the complex events behind them.
2. Qualitative criteria present clearer leads for required corrective action; whereas statistical results may give little indication of why events happened as they did.
3. Qualitative criteria tend to be broader because they are not tied to particular measurable points; whereas quantitative criteria, just because they have to be particularized to certain narrow segments of field operations in order to be measured, may result in very diverse inferences being drawn from them in Judging overall performance.

Applying the Results

Up to this point, we have mentioned some general implications of our findings; now we would like to draw them together and offer some specific suggestions for action:

Most important is the need to recognize that managers-not computers, numbers, or information systems are the critical element in the assessment of subunit performance.

Statistical reports have increasingly taken on the revered status of “objectivity,” while managerial judgments have too often been sidetracked as overly “subjective” or “opinionated.” Thus we find organizations building larger headquarters staffs to process ever larger amounts of statistical performance information, much of which is never used.

All organizations ought to be vitally concerned with studying how their managers actually assess subunit performance. While organizations frequently spend large sums of money generating more and more information about subunit performance, they seldom consider what information is actually used or needed by their managers. Nor do they always recognize the importance of achieving a high level of agreement among top managers about subunit performance. If managers cannot agree, there is something amiss, either with the company’s information system or with the managers and their organization­ al environment.

We, therefore, suggest that organizations take periodic “X-rays” of their judgmental process. The study methodology used in the IRS is one useful approach. The findings can be used both for training managers to reach more
informed judgments about subunit performance, and for designing information systems that will provide more help to managers in making their judgments.

Management training should use research findings derived from the actual company environment to design programs that fit the needs of its particular management group. To do this, the key management group participating in the study could be brought together to hear and discuss the significant findings.

The critical questions before these managers should be: (a) Why do we have these results? (b) What do we do about them? The answers, of course, will vary with each organization and its unique findings. But the general thrust of actions afterward should be more informed and constructive.

For example, a top management group may decide to pay greater attention to the judgments of its field managers. Or agreement may be reached to place greater and more explicit emphasis on qualitative criteria. Or the present information system may be altered to provide data on those criteria that are most frequently used while eliminating data on those that are not heavily used.

The next step would be to discuss the findings within smaller functional groupings. It would be useful for those groups in the lowest agreement to sit down and discuss why they see subunits from such different perspectives. They may discover, in the process of this examination, that they fail to discuss their observation sufficiently with each other. Or they may find that each is using too diverse a set of criteria and that more consensus needs to be reached on particular criteria.

The formal information system must be designed to complement these changes if they are to be put into practice. For instance, it makes little sense for a company computer to continue providing data on 50 variables when its managers are in substantial agreement after using data on only 15 variables.

A real challenge for some organizations is to build more qualitative information into their formal systems. One method used in some companies is to request a written narrative with each submission of statistics from the field. Another method is to hold periodic, in-depth discussions involving several managers from different levels so that each can contribute whatever qualitative data are available to him.

Organizations might also consider the possibility of incorporating a judgmental procedure, such as the one used in this study, into an organization’s on• going process of performance assessment. Managers need to consider the challenge of systematically recording managerial judgments as much as they systematize statistical results. Lower-level managers can attest to their feelings of frustration when one upper-level manager tells one of them that he is doing a “good” job while another upper-level manager downgrades him. On the other hand, if this same manager knows that five upper-level managers systematically agree in ranking his unit at the “tail end” of the pack, he cannot as easily rationalize his position.

Our suggestion at this early stage, however, is to experiment with, but not institutionalize, a more systematic judging procedure, perhaps in only one division of a large company. Every formal system, if taken too seriously and rigidly, can become more of a hindrance than a help. There are always bugs to be worked out of these systems before giving them a wider application.


A major task of any management is to know what is actually taking place with n its organization. One critical, but seldom examined, the function is the manner in which key managers assess the performance of units under their supervision. In the absence of knowledge, numerous myths and assumptions have abounded. Particularly noticeable is a growing mistrust of the reliability of managerial judgments. Signs of this mistrust are reflected in current trends toward more statistics, more computers, more information specialists, and more centralized information systems-the IRS, where this study was conducted, is no exception.

Yet the findings of this study seriously dispute many of these newer trends and assumptions. Notably:

1. Managers can generally agree with each other about the current effective­ ness and performance trends of subunits under them.
2. Their judgments seem to be quite stable from one day to the next.
3. Managers who agree most with their colleagues tend to come from levels closest to the field; work in smaller divisions; use more commonly shared criteria, and rely more on qualitative than quantitative criteria.

These conclusions must be qualified to the extent that they are based on one study in a single nonbusiness organization. Therefore, we should treat them more as propositions to be tested further than as final answers.

However, we believe these findings place a new challenge before every management: to seek new ways of studying, assisting, and restoring confidence in the performance judgments of their managers. This will not occur magically. First, concerning management will have to investigate and identify its current practices for judging performance. Then, it will have to use the study findings to train its managers in improving their judgmental practices. Finally, it will have to strive to make its information system a more helpful servant rather than an irrelevant master.

Copyright © 2020. All Rights Reserved.

Leave a Reply