No. A144499 (Cal. Ct. App. Oct. 3, 2017)

From Casetext: Smarter Legal Research

Danner v. City of S.F.

COURT OF APPEAL OF THE STATE OF CALIFORNIA FIRST APPELLATE DISTRICT DIVISION FIVE

Oct 3, 2017

No. A144499 (Cal. Ct. App. Oct. 3, 2017)

Opinion

A144499

10-03-2017

JOHN H. DANNER III, et al., Plaintiffs and Appellants, v. THE CITY AND COUNTY OF SAN FRANCISCO, et al., Defendants and Respondents.

SIMONS, J.

ORDER MODIFYING OPINION AND DENYING REHEARING
[NO CHANGE IN JUDGMENT] THE COURT:

It is ordered that the opinion filed October 3, 2017 be modified as follows:

(1) Insert the following footnote on page 17 at the conclusion of the first full paragraph: "In a petition for rehearing, Plaintiffs contend the City's failure to perform any "pre-testing" to establish the reliability of the Exam renders the Exam invalid under Guardians. Plaintiffs did not raise this argument in their briefs on appeal. In any event, contrary to Plaintiffs' contention, Guardians did not hold that pre-testing is required to justify rank-ordered scoring. Instead, Guardians identified pre-testing as one means of demonstrating the reliability of test questions. (Guardians, 630 F.2d at p. 102.) The City's validation report states an alternative measure of reliability was used and Plaintiffs do not cite any testimony by Zedeck that this alternative measure was insufficient."

(2) Renumber the following footnotes accordingly.

(3) Replace the third sentence of the second full paragraph on page 8 with the following: "With one exception (addressed post, Part II.B.2), Plaintiffs also do not dispute the Exam tested the candidates on these knowledge, skills, and abilities."

There is no change in the judgment.

Plaintiffs' petition for rehearing is denied. Dated:__________

/s/_________, P.J.

NOT TO BE PUBLISHED IN OFFICIAL REPORTS

California Rules of Court, rule 8.1115(a), prohibits courts and parties from citing or relying on opinions not certified for publication or ordered published, except as specified by rule 8.1115(b). This opinion has not been certified for publication or ordered published for purposes of rule 8.1115. (San Francisco County Super. Ct. No. CGC-10-501981)

Plaintiffs—sixteen firefighters employed by the City and County of San Francisco (the City)—were not promoted following a promotion examination and sued the City for age discrimination. Following trial, a jury issued a verdict for Plaintiffs. We affirm the trial court's order granting the City judgment notwithstanding the verdict.

Plaintiffs also sued certain City agencies and a City employee. For convenience, we will refer to all defendants as "the City."

FACTUAL BACKGROUND

In 2007, the City's Fire Services Exam Unit (Exam Unit) began the process of developing a promotional examination (the Exam) for the position of H-20 Lieutenant (lieutenant) in the City's Fire Department. The Exam Unit began by conducting an extensive job analysis to determine the essential elements of the position. The Exam Unit asked lieutenants and captains (who supervise lieutenants) in the Fire Department to develop a list of the required knowledge, skills, and abilities. The Exam Unit grouped the items on this list into related groups, or "clusters," resulting in seven knowledge clusters, ten ability clusters, and 16 task clusters. The lieutenants and captains then rated the knowledge and ability clusters to identify their relative importance; rated the task clusters to identify their relative importance and frequency; and identified which knowledge and ability clusters were essential or helpful to performing each task cluster.

For example, the clusters included "knowledge of firefighting and suppression equipment," "knowledge of fire science," and "management."

Using the job analysis, the Exam Unit worked with a Fire Department Battalion Chief to develop the Exam. The first part of the Exam was a fire scene and first aid simulation exercise (the Fire Scene Component): candidates were shown photographs and street diagrams representing four different fire scenarios and, for each scenario, had to write down the initial communications they would make, the actions they would take, the equipment they would need, and other pertinent information. For the fourth scenario, candidates also had to complete a post-fire analysis and injury report documenting their actions in that scenario. The second part of the Exam was a training and counseling exercise (the Training Component), in which candidates were videotaped training and counseling or disciplining an actor playing the role of a probationary firefighter.

"Scoring keys" for the Fire Scene Component—lists of correct answers to each exam question—were developed by a five-member committee composed of Fire Department members with the rank of Battalion Chief or above (the Scoring Key Committee). Another such committee created scoring keys for the Training Component. The individual exams were rated by dozens of fire officers from other cities who held the rank of Captain or higher. The raters underwent a day-long training and rated the exams using the scoring keys.

Approximately 750 candidates completed the Fire Scene Component in May 2008. Candidates who received a below average score on the Fire Scene Component were not invited to complete the Training Component. In August 2008, 409 candidates completed the Training Component. The City ranked these 409 candidates based on their combined scores for both components; the final scores ranged from 707 to 1000. With a few exceptions, promotions to lieutenant were made in rank order based on their Exam scores. Plaintiffs either received below average scores on the Fire Scene Component or, based on their combined score, ranked too low to receive promotions.

Following the Exam, the Exam Unit prepared a report (the Validation Report) describing the examination process and concluding the Exam was valid under professional standards of reliability. Outside consultant Harry Brull—who had provided consultation during the entire examination process—also testified the Exam was valid and job-related.

PROCEDURAL BACKGROUND

In 2010, Plaintiffs filed the instant lawsuit. A jury trial was held on Plaintiffs' claim that the Exam discriminated on the basis of age in violation of Government Code section 12940, part of the California Fair Employment and Housing Act (Govt. Code, § 12900 et seq.; FEHA). Plaintiffs' evidence included the testimony of an expert in examination validity and job relatedness, Sheldon Zedeck.

Zedeck was unavailable to testify at trial but his deposition testimony was read to the jury.

By special verdict, the jury found the Exam had a disparate impact on persons over the age of 40; that the purpose of the Exam was to operate the business safely and efficiently and in a job-related manner; but that the Exam did not substantially accomplish this business purpose. The City moved for judgment notwithstanding the verdict. The trial court granted the City's motion, concluding "no substantial evidence exists to support the jury's finding that the 2008 Exam was not job-related."

Because, as explained below, we are affirming on this ground, we need not decide whether the trial court's ruling could be affirmed on the alternate ground that no substantial evidence supports the jury's finding that the Exam had a disparate impact on persons over 40. We also need not decide whether, as Plaintiffs contend, the trial court erred in denying Plaintiffs' post-verdict motion for equitable relief and in granting the City's alternative motion for a new trial.

DISCUSSION

I. Legal Standards

A. Discriminatory Employment Tests

California law prohibits employment discrimination on the basis of age. (Govt. Code, § 12940, subd. (a).) "A testing device or other means of selection that is facially neutral, but that has an adverse impact (as defined in the Uniform Guidelines on Employee Selection Procedures (29 C.F.R. 1607 (1978)) upon persons on a basis enumerated in [FEHA], is permissible only upon a showing that the selection practice is job-related and consistent with business necessity . . . ." (Cal. Code Regs., tit. 2, § 11017, subd. (e).) In other words, " '[d]iscriminatory tests are impermissible unless shown, by professionally acceptable methods, to be predictive of or significantly correlated with important elements of work behavior which comprise or are relevant to the job or jobs for which candidates are being evaluated.' " (Assn. of Mexican-American Educators v. California (9th Cir. 2000) 231 F.3d 572, 584 (en banc) (AMAE).)

" ' "Because the antidiscrimination objectives and relevant wording of title VII of the Civil Rights Act of 1964 (Title VII) [(42 U.S.C. § 2000e et seq.)] [and other federal antidiscrimination statutes] are similar to those of the FEHA, California courts often look to federal decisions interpreting these statutes for assistance in interpreting the FEHA." ' " (Richards v. CH2M Hill, Inc. (2001) 26 Cal.4th 798, 812.) The parties agree federal cases are relevant to our analysis of FEHA's requirements.

When a plaintiff has established a prima facie case of discrimination by proving that an employment test disparately impacts a protected group, "the burden shifts to [the] [d]efendants to demonstrate that the [employment test] was validated properly." (AMAE, supra, 231 F.3d at p. 584, fn. omitted.) To meet this burden, the defendants "are required to 'show that [the test] has "a manifest relationship to the employment in question." ' [Citation.] In cases in which a scored test, like this one, is challenged, we require that the test be 'job related'—that is, 'that it actually measures skills, knowledge, or ability required for successful performance of the job.' [Citation.] In making a determination about job-relatedness, we follow a three-step approach: 'The employer must first specify the particular trait or characteristic which the selection device is being used to identify or measure. The employer must then determine that the particular trait or characteristic is an important element of work behavior. Finally, the employer must demonstrate by 'professionally acceptable methods' that the selection device is 'predictive of or significantly correlated' with the element of work behavior identified in the second step." (Id. at p. 585.)

If the employer demonstrates job-relatedness, "the burden shifts back to the plaintiff to show the existence of other selection devices that also would 'serve the employer's legitimate interest in efficient and trustworthy workmanship,' but that are not discriminatory." (AMAE, supra, 231 F.3d at p. 584, fn. 7.) Plaintiffs do not contend they made such a showing here.

However, while the employer "must demonstrate a significant relation between the challenged selection device or criteria and the important elements of the job or training program, . . . the employer need not establish a perfect positive correlation between the selection criteria and the important elements of work." (Craig v. County of Los Angeles (9th Cir. 1980) 626 F.2d 659, 664, fn. omitted (Craig).) "It would be unrealistic to require more than a reasonable measure of job performance." (Bryant v. City of Chicago (7th Cir. 2000) 200 F.3d 1092, 1098-1099 (Bryant).) "What is required is not perfect reliability, but rather a sufficient degree of reliability to justify the use being made of the test results." (Guardians Assn. of New York City Police Dept., Inc. v. Civil Service Commission of City of New York (2d Cir. 1980) 630 F.2d 79, 101 (Guardians).)

Plaintiffs highlight three cases involving employment tests that disparately impacted a protected group. In City and County of San Francisco v. Fair Employment & Housing Com. (1987) 191 Cal.App.3d 976 (San Francisco v. FEHC), the Court of Appeal concluded substantial evidence supported the Fair Employment and Housing Commission's conclusion that a promotion exam was not sufficiently job-related. (Id. at pp. 990-991.) This conclusion was based on the evidence that supervision was the primary function of the job, yet the ability to supervise was not tested by the exam, resulting in "an examination process for promotion to [a job] which fails to measure the primary function of that job." (Ibid.) The Court of Appeal also found "no evidence that the [employer] made any attempt to correlate examination scores with satisfactory job performance." (Id. at p. 990.)

In the second case, Guardians, the federal Court of Appeals affirmed the trial court's conclusion that a hiring exam was invalid. (Guardians, supra, 630 F.2d at p. 106.) The court identified "five attributes of an exam with sufficient content validity to be used notwithstanding its disparate racial impact. The first two concern the quality of the test's development: (1) the test-makers must have conducted a suitable job analysis, and (2) they must have used reasonable competence in constructing the test itself. The next three attributes are more in the nature of standards that the test, as produced and used, must be shown to have met. The basic requirement, really the essence of content validation, is (3) that the content of the test must be related to the content of the job. In addition, (4) the content of the test must be representative of the content of the job. Finally, the test must be used with (5) a scoring system that usefully selects from among the applicants those who can better perform the job." (Id. at p. 95.) The court also held that when a test's scores are rank-ordered, it must be "shown that 'a higher score . . . is likely to result in better job performance' " because "[i]f test scores do not vary directly with job performance, ranking the candidates on the basis of their scores will not select better employees." (Id. at p. 100.)

" 'Evidence of the validity of a test or other selection procedure by a content validity study should consist of data showing that the content of the selection procedure is representative of important aspects of performance on the job for which the candidates are to be evaluated.' " (AMAE, supra, 231 F.3d at p. 585, fn. 8.)

In the exam at issue in Guardians, rank ordering was used where "high test scores were closely bunched" to an "extraordinary extent"—"Each score from 94 to 97 was achieved by over 2,000 applicants." (Guardians, supra, 630 F.2d at p. 103.) The court found this rendered the scores an unreliable predictor of job performance: "If the test questions had sufficient differentiating power to produce a somewhat even distribution of scores, or at least to avoid excessive bunching among the high scores, the error of measurement would not have affected the ultimate selection of such a significant portion of the applicants. But when 8,928 applicants, two-thirds of all who passed, are bunched between 94 and 97, the error of measurement makes the use of rank-ordering an extremely unreliable basis for hiring decisions." (Ibid.) It noted, "if an exam lacks reliability to such an extent that results would be significantly inconsistent if the same applicants were to take it again, that is an important indication that the test is not especially useful in measuring their abilities. . . . What is required is not perfect reliability, but rather a sufficient degree of reliability to justify the use being made of the test results." (Id. at p. 101.)

The final case emphasized by Plaintiffs is Bryant, which held employment tests are valid if " 'they are demonstrably a reasonable measure of job performance.' " (Bryant, supra, 200 F.3d at pp. 1098.) The court further held that "rank-order promotions can be validated by a substantial showing that (1) the test is job related and representative and (2) the test maker achieved 'an adequate degree of reliability.' " (Id. at p. 1099.)

B. Standard of Review

"In ruling on a motion for JNOV [judgment notwithstanding the verdict], ' "the trial court may not weigh the evidence or judge the credibility of the witnesses, as it may do on a motion for a new trial, but must accept the evidence tending to support the verdict as true, unless on its face it should be inherently incredible. Such order may be granted only when, disregarding conflicting evidence and indulging in every legitimate inference which may be drawn from plaintiff's evidence, the result is no evidence sufficiently substantial to support the verdict. [¶] On an appeal from the judgment notwithstanding the verdict, the appellate court must read the record in the light most advantageous to the plaintiff, resolve all conflicts in his favor, and give him the benefit of all reasonable inferences in support of the original verdict." ' " (Carter v. CB Richard Ellis, Inc. (2004) 122 Cal.App.4th 1313, 1320.)

The City bore the burden of proving the Exam was sufficiently job related; the jury found the City failed to meet that burden. " '[W]here the issue on appeal turns on a failure of proof at trial, the question for a reviewing court becomes whether the evidence compels a finding in favor of the [the City] as a matter of law. [Citations.] Specifically, the question becomes whether the [the City's] evidence was (1) "uncontradicted and unimpeached" and (2) "of such a character and weight as to leave no room for a judicial determination that it was insufficient to support a finding." ' " (Sonic Manufacturing Technologies, Inc. v. AAE Systems, Inc. (2011) 196 Cal.App.4th 456, 465-466.)

II. Analysis

Plaintiffs contend substantial evidence supports the jury's finding that the City failed to demonstrate the Exam was job related.

We begin by noting what Plaintiffs do not dispute. Plaintiffs do not dispute the job analysis accurately identified the key job knowledge, skills, and abilities. Plaintiffs also do not dispute the Exam tested the candidates on these knowledge, skills, and abilities. Instead, Plaintiffs contend the weighting and scoring of the Exam rendered the results unreliable.

A. Three Alleged "Fatal Flaws" Involving Weighting the Exam

Plaintiffs first argue substantial evidence shows three "fatal flaws" in the Exam involving the weighting of various Exam components.

1. Legal Standard Relating to Weighting Exam Components

Weighting, as the parties and the witnesses use the term, refers to multiplying the score of an exam component by a fixed number in order to increase or decrease the relative value of that component to the overall score. The purpose of weighting is to ensure the score of that component reflects the proportional importance of the skill or knowledge tested by that component. For example, take an exam composed of two parts, with 50 points possible for each part. If the skills tested by each part were equally important, no weighting would be necessary. If, however, one of the parts tested skills that were three times as important as the skills tested by the other part, weighting would be necessary to ensure the points for the former constituted 75% of the total score and the points for the latter constituted 25% of the total score.

The parties have not cited any cases involving a job-relatedness challenge based on weighting. However, the City argues cases discussing the requirement that "the content of the test must be representative of the content of the job" (Guardians, supra, 630 F.2d at p. 95) are relevant to this issue. Plaintiffs do not contend otherwise, and we agree with the City that these cases apply.

Guardians' discussion of the representativeness requirement is informative. In that case, the court addressed an "argu[ment] that the requirement that the content of the exam be representative means that all the knowledges, skills, or abilities required for the job be tested for, each in its proper proportion. This is not even theoretically possible, since some of the required capacities cannot be tested for in any valid manner. Even if they could be, the task of identifying every capacity and determining its appropriate proportion is a practical impossibility." (Guardians, supra, 630 F.2d at p. 98.) "More reasonable interpretations of the representativeness requirement are appropriate in light of Title VII's basic purposes. The reason for a requirement that the content of the exam be representative is to prevent either the use of some minor aspect of the job as the basis for the selection procedure or the needless elimination of some significant part of the job's requirements from the selection process entirely; this adds a quantitative element to the qualitative requirement—that the content of the test be related to the content of the job. Thus, it is reasonable to insist that the test measure important aspects of the job, at least those for which appropriate measurement is feasible, but not that it measure all aspects, regardless of significance, in their exact proportions." (Id. at pp. 98-99.)

Guardians also discussed a separate aspect of the relatedness requirement not challenged here, "that the procedure, or methodology, of the test must be similar to the procedures required by the job itself." (Guardians, supra, 630 F.2d at p. 98.)

With this background, Guardians concluded the exam at issue—a police officer hiring exam—"meets these representativeness requirements to an adequate degree. While it did not test for all the skills involved in being a police officer nor adequately test for the human relations skill that the job analysis identified as important, the ones it did measure—memory, the ability to fill out forms, and the ability to apply rules to factual situations—are all significant aspects of entry-level police work. To be sure, this conclusion would have been easier to reach if the City had spelled out the relationship between the abilities that were tested for and the job behaviors that had been identified. But the relationship is sufficiently apparent to indicate that the City was not seizing on minor aspects of the police officer's job as the basis for selection of candidates." (Guardians, supra, 630 F.2d at p. 99.)

The court previously concluded the questions designed to test for an applicant's human relations skill were "primarily a further assessment of a candidate's ability to apply written standards to specific fact situations, and only slightly a measure of his talent for human relations." (Guardians, supra, 630 F.2d at p. 97.)

The job analysis identified 42 job tasks and five abilities necessary to perform these tasks, but "no effort was made to explain the relationship between any of the five abilities and the 42 job tasks from which they were ostensibly derived." (Guardians, supra, 630 F.2d at p. 96.)

In another case, Police Officers for Equal Rights v. City of Columbus (6th Cir. 1990) 916 F.2d 1092 (Police Officers), the employer " 'identif[ied] the tasks involved in the job, . . . rated them according to importance and frequency and in the process identified the most important task categories[,] . . . constructed a test which tested for all or nearly all of the task categories, and emphasized the most important task categories.' " (Police Officers, at pp. 1099-1100.) The plaintiffs "argue[d] that the lieutenant examination does not represent the requirements of the position of lieutenant because the test does not measure attributes in proportion to their importance and frequency of use in the performance of the job." (Id. at p. 1099, fn. omitted.) The court agreed with Guardians that "relatedness does not require precise proportionality," and concluded that "testing for nearly all of the task categories and emphasizing the most important categories" is sufficient to satisfy the relatedness requirement. (Police Officers, at p. 1100; see also Gillespie v. Wisconsin (7th Cir. 1985) 771 F.2d 1035, 1044 [rejecting argument that exam was not job-related because it "did not test all or nearly all skills required," holding that "[t]o be representative for Title VII purposes, an employment test must neither: (1) focus exclusively on a minor aspect of the position; nor (2) fail to test a significant skill required by the position"]; Craig, supra, 626 F.2d at p. 664 ["the employer need not establish a perfect positive correlation between the selection criteria and the important elements of work"].)

2. Determining the Weights of the Fire Scene and Training Components

i. Background

After the job analysis was complete and the Exam was created, the Exam Unit determined the relative weight of the two primary components of the Exam—the Fire Scene Component and the Training Component. During the job analysis, the lieutenants and captains rated each of 17 knowledge and ability clusters to identify their relative importance by distributing 100 points among the 17 clusters. The Exam Unit then reviewed each of the two primary components of the Exam and determined the extent to which that component tested each of the 17 clusters. The Exam Unit used a scale of zero to three to measure the extent to which each component tested each cluster: zero indicated the component did not measure that cluster of knowledge and abilities; one indicated it measured them between 1 and 33%; two indicated it measured them between 34 and 66%; and three indicated it measured them between 67 and 100%. The Exam Unit then used the relative importance of each cluster and the extent to which each component measured each cluster to determine the relative weight of the two components, concluding the Fire Scene Component should account for 61% of a candidate's final score and the Training Component should account for 39% of a candidate's final score.

Plaintiff's expert, Sheldon Zedeck, criticized the use of a zero to three scale to measure the extent to which each component tested each cluster of knowledge and abilities: "by using a zero, one, two and three, which leads you to a certain outcome, I think it's possible that you have an inappropriate weighting of 69 -- .69 and .31." Zedeck further testified, "We don't have industry standards [for the level of precision required]. We have procedures that are best practices. My conclusion is using weights of zero, one, two and three is too gross and is not a best practice." Zedeck opined this was a "fatal flaw": "I cannot rely on the scores obtained on this examination to make decisions" because if the weights of the two primary components were different, "then all of our scores would be different. There would be a different ranking and you might be choosing different people."

ii. Analysis

Plaintiffs' underlying contention is that, because the method of determining the relative weights of the Fire Scene and Training Components was too imprecise, the Exam scores may not have accurately reflected the precise proportional importance of the knowledge and skills to the lieutenant position. We credit, as we must, Zedeck's testimony that the method of determining these weights was too imprecise to accurately reflect this proportional importance.

However, Plaintiffs do not contend the Exam focused exclusively on a minor aspect of the lieutenant position or failed to test a significant skill required by the position. (Cf. San Francisco v. FEHC, supra, 191 Cal.App.3d at p. 991 [substantial evidence supported finding that promotion exam was not job related where the "examination process . . . fails to measure the primary function of that job"].) Plaintiffs also do not dispute that the weights assigned—61% for the Fire Scene Component and 39% for the Training Component—roughly correspond with the proportional importance of the knowledge and skills tested in those components. They do not contend the Exam Unit was unqualified to measure the extent to which each exam component tested each cluster, or that any of its zero to three measurements were inaccurate. In other words, it is undisputed that the Exam tested for all significant job skills and emphasized the most important skills. As a matter of law, this is sufficient to satisfy the representativeness requirement. (Police Officers, 916 F.2d at p. 1100 ["Appellants argue that testing for nearly all of the task categories and emphasizing the most important categories is insufficient. We disagree."]; Guardians, supra, 630 F.2d at pp. 98-99 ["it is reasonable to insist that the test measure important aspects of the job, . . . but not that it measure all aspects . . . in their exact proportions"].)

We note that Plaintiffs' argument implicitly suggests it is possible to objectively measure the relative importance of each job skill and the extent to which a test component measures each such skill. To the contrary, these are inherently subjective measurements. "[T]he science of testing is not as precise as physics or chemistry, nor its conclusions as provable." (Guardians, supra, 630 F.2d at p. 89.) Accordingly, "the task of identifying every capacity and determining its appropriate proportion is a practical impossibility." (Id. at p. 98.) The law does not hold employers to an impossible standard. An exam that tests all significant job skills in approximate proportion to their relative importance is sufficient.

Zedeck's opinion that the approximate weighting constituted a "fatal flaw" invalidating the exam is not substantial evidence supporting the jury verdict because it rests on the incorrect assumption that further precision is required by law. (Wise v. DLA Piper LLP (US) (2013) 220 Cal.App.4th 1180, 1192 [" '[A]n expert's opinion that assumes an incorrect legal theory cannot constitute substantial evidence.' "].)

3. Determining the Weights of Exam "Dimensions"

i. Background

As noted above, scoring keys for the Fire Scene Component were developed by the Scoring Key Committee, a five-member committee composed of Fire Department members with the rank of Battalion Chief or above. For many of the questions, the Scoring Key Committee identified more than one correct answer; each possible answer was called a "response anchor." For example, in each of the four fire scenarios presented by Fire Scene Component, one of the questions was: "Would you request additional resources? If so, what additional resources do you request?" For one of the scenarios, the scoring key might list the response anchors for that question as: "Full box," "2nd alarm," "PG&E - high voltage," "DPT/PD - traffic control," and "Water dept." The Scoring Key Committee assigned point values of one through five for each response anchor, to indicate its importance in that scenario.

On the scoring key, these anchors (or answers) were grouped into "dimensions." Each dimension, or group of anchors, represented one of the test questions. As explained in the validation report, "The responses to each test question constituted a 'dimension' on the scoring key." After rating the relative importance of each response anchor within a dimension, the Scoring Key Committee also rated the relative importance of each dimension within a scenario. As an Exam Unit staff member testified, "For example, in [the dimension] Communications, there may be 12 items. So it's a minimum of 12 points. [The dimension] Apparatus Placement, there may be three possibilities, a candidate can only get one. [¶] So if they're all 1 point, it's a 12-to-1 ratio. And I asked the key developers, 'Okay, does this make sense? Should you be getting 12 times as many points for initial communication as you get for apparatus placement?' [¶] They say, 'No.' [¶] So we look at the dimensions. I ask them to tell me how the dimensions should relate to each other with regard to value."

This dimension represents the exam question, "What initial report do you give to the Division of Emergency Communications?"

This dimension represents the exam question, "Where would you position your apparatus?"

With respect to this issue, Zedeck testified: "The weights of the dimensions should come from the job analysis," rather than from the Scoring Key Committee members.

ii. Analysis

Plaintiffs do not dispute the qualifications of the Scoring Key Committee members to determine the relative importance the dimensions; indeed, Zedeck testified the Scoring Key Committee members were well qualified to assign importance to the individual response anchors. Plaintiffs' sole contention is the weights of the dimensions should have come from the job analysis, rather than the Scoring Key Committee members.

Zedeck testified, "[t]he scoring key committee, which is composed of I believe people at the captain and above rank, the ones who supervise the lieutenant, we assume they know the job and they have responsibility for supervision. So they are the ones who would be determining the most appropriate answers."

We note Zedeck did not testify that using weights assigned by Scoring Key Committee members rather than the job analysis rendered the Exam unreliable. In their briefs and at oral argument, Plaintiffs point to his testimony that "changing weights of keys is a fatal flaw," but Plaintiffs are not challenging the practice of changing the dimension weights, instead they challenge the source of those weights. Zedeck did not testify the Scoring Key Committee lacked the ability to correctly determine the dimension weights, but rather stated, "I'm not sure why they would do that because that's in the job analysis."

Weighting the dimensions is simply another means of ensuring the Exam reflects the proportional importance of various knowledge and skills required by the lieutenant position. Plaintiffs do not contend the dimension weights focused exclusively on a minor aspect of the lieutenant position or failed to test a significant skill required by the position. They do not claim the Scoring Key Committee was unqualified to determine the relative weights. That the weights could also have come from the job analysis does not render the Exam invalid. (See Guardians, supra, 630 F.2d at pp. 98-99.)

Plaintiffs argue there is evidence that the weights assigned to the dimensions do not precisely correspond with the dimensions' proportional importance. Plaintiffs point to the job analysis weight of 12.2% assigned to written and oral communications. Plaintiffs compare this figure with the Scoring Key Committee assignment of 20 to 29% for communications dimensions in the Fire Scene Component. As an initial matter, we are dubious that this level of mismatch could render the Exam invalid in light of the legal standards regarding proportionality discussed above. In any event, Plaintiffs' analysis assumes every scenario tests each of the 17 clusters. In fact, the Exam Unit determined that not all clusters were tested by the Fire Scene Component; as the validation study stated, "The two test components were designed to sample different task clusters associated with the lieutenant position." Therefore, the relative importance of any given cluster to a scenario in the Fire Scene Component would necessarily be greater than its importance relative to all 17 clusters.

4. Failure to Standardize Certain Subcomponent Scores Before Weighting

i. Background

The final aspect of weighting targeted by Plaintiffs is the City's failure to standardize certain portions of the Exam before weighting them. First, the City did not standardize the dimension scores before weighting them. Second, an Exam Unit staff member testified that the score for the post-fire documentation report completed as part of the fourth scenario in the Fire Scene Component was multiplied "by a factor of .15 because the scoring key committee had indicated it should only be worth 15 percent of what the maximum value was." The multiplication of this score by .15 was done before the scores were standardized. Although these subcomponents were not standardized before being weighted, the Fire Scene Component and Training Component scores were standardized before they were weighted.

As explained in an "Examination Manual" issued by the City's Department of Human Resources, "a standard score is the result of converting a raw score to a new value . . . that compares a score to all other scores in its distribution." The manual further discusses the significance of standardization when an exam has different weighted components: "When weighted raw scores from multiple components are combined, the assigned weights of the components may not have the expected contribution to the total score," therefore standardizing component scores before weighting ensures "that the assigned weight of a component is the weight it actually contributes to the total score."

Zedeck testified, "to get the score on the fire scene, the dimensions should have been standardized" before weights were applied. He also testified, with respect to the documentation report, "what you should have done is standardize the documentation score" and then "multiply the standardized documentation score by .15."

ii. Analysis

This concern again implicates the relative importance placed on the job skills tested by the Exam. As explained in the City's Examination Manual, the purpose of standardizing test component scores is so that the individual components retain their relative weight: "When weighted raw scores from multiple components are combined, the assigned weights of the components may not have the expected contribution to the total score. For example, assume that a test consists of Components A and B where Component A is assigned a weight of 65% and Component B is assigned a weight of 35%. If weighted raw scores from Components A and B are simply added together to reach a total score, the effective weight of Component A may not be 65% and the effective weight of Component B may not be 35%." Yet, as discussed above, approximate proportionality is sufficient. (Guardians, supra, 630 F.2d at p. 99.) If the failure to standardize the dimensions or documentation scores meant the skills tested by those sections constituted a somewhat greater or somewhat less portion of the total exam score, this result does not render the Exam invalid as a matter of law.

We note Zedeck did not testify the failure to standardize the dimensions or documentation section before weighting rendered the Exam unreliable. Plaintiffs point to Zedeck's testimony that "the way that I understand the weights have been determined and the process leading to those weights . . . is a fatal flaw," but this testimony was made during a discussion about Zedeck's concern with "the determination of the weights of .61 and .39 and the process leading to those weights," not the weighting of dimensions or the documentation section.

Plaintiffs emphasize a portion of the City's exam manual stating, "it would not be uncommon for 30 to 40 percent of the eligibles on an eligible list to shift rank when the procedure for combining scores changes." However, this potential 30 to 40% shift in rank occurs when weighting for components constituting an entire exam is done before standardization. Plaintiffs do not dispute that the component scores for the Fire Scene Component and the Training Component were standardized before those components were weighted. The scores that were not standardized constituted only smaller portions of the Exam, and it is not reasonable to infer that such a large shift could occur from a much smaller failure to standardize. In any event, because precise proportionality is not required, the fact alone that some scores might shift with a different weighting method is not sufficient to rebut evidence of job-relatedness. (Guardians, supra, 630 F.2d at p. 101 ["What is required is not perfect reliability, but rather a sufficient degree of reliability to justify the use being made of the test results."].)

B. Other Errors

In addition to the three "fatal flaws" discussed above, Plaintiffs contend the Exam "was riddled with other errors." Plaintiffs concede Zedeck did not testify that each such error invalidated the Exam, but they note his testimony that certain concerns "would give me pause" and "[a]s you start adding up all these pauses, then I have concerns about the inferences."

1. Scoring Errors

Plaintiffs first point to errors in scoring the Exam and argue these errors rendered the final scores unreliable.

i. Background

While processing the Exam scores, the Exam Unit checked for mathematical errors or anomalies. During these checks, the Exam Unit determined that scoring errors had been made by raters. Specifically: (1) one pair of raters had failed to give candidates credit for one answer; and (2) certain scores were mathematically not possible because, for example, the scoring key anchors could only result in scores that were multiples of three but the scores were not multiples of three. The Exam Unit corrected these errors, although Plaintiffs disputed some of the corrections. Plaintiffs also presented evidence of 23 additional rating errors not discovered by the Exam Unit.

The scores were also sent to the City's outside consultant, Brull, for an independent review.

The parties both calculated the approximate total number of scoring errors discovered during the exam process and during trial as 150 out of approximately 17,000 total scores on the Fire Scene Component.

ii. Analysis

To be found content valid, an exam must employ "a scoring system that usefully selects from among the applicants those who can better perform the job." (Guardians, supra, 630 F.2d at p. 95.) The law does not require perfection, as "some error of measurement is inevitable." (Id. at p. 104.) Instead, the employer must demonstrate " 'an adequate degree of reliability.' " (Bryant, supra, 200 F.3d at p. 1099.)

As the City's expert and the trial court noted, the error rate for Exam scores was less than 1%. We conclude that such a small error rate—with many of the errors caught and corrected during the scoring process—constitutes as a matter of law " 'an adequate degree of reliability.' " (Bryant, supra, 200 F.3d at p. 1099.) We note Zedeck did not testify the error rate rendered the Exam unreliable and declined to identify any quantity or rate of scoring error that would invalidate an exam, while the City's expert, Brull, testified this rate of error was within professional standards.

Plaintiffs contend the error rate is more than 1% because of the three alleged weighting errors discussed in part II.A, ante. As discussed above, the City's weighting of various exam components falls well within the legal requirements. Any scoring changes that would result with different weighting, therefore, do not constitute scoring errors.

Plaintiffs argue it was inappropriate for the Exam Unit to change scores without consulting the raters, but this argument does not address the extremely small number of such changes relative to the total number of scores.

Plaintiffs argue there were likely substantial additional scoring errors which cannot be discovered because the City destroyed the raters' scoring keys. We agree with the trial court that this is not a reasonable inference. Although the rater's scoring keys were destroyed (in what even the City's expert conceded was not normal protocol), scoring sheets recording each rater's total score for each dimension were preserved, as were the candidates' exam responses. Checking the accuracy of scores was therefore not impossible, as Plaintiffs suggest. In any event, most of the identified errors were discovered during the scoring process. While it is a reasonable inference that there may have been a small additional number of errors, it is not reasonable to infer that there were substantially more errors in scoring.

2. Scoring Key

Plaintiffs argue there was evidence the Fire Scene Component scoring key was incorrect. They point to testimony by two Scoring Key Committee members that they did not recommend certain anchors appearing on the final scoring key.

Plaintiffs claim one witness did not recommend 15% of the anchors and the other did not recommend 18% of the anchors; the City's figures are slightly lower (approximately 10% and 15%, respectively). The precise numbers are not relevant to our analysis.

The undisputed testimony—from witnesses presented by Plaintiffs and the City—was that the five Scoring Key Committee members often disagreed about which anchors should be included in the scoring key and reached decisions by majority agreement and/or referring to Fire Department manuals. That some Scoring Key Committee members disagreed with some of the anchors does not render the Exam invalid, nor does it support an inference that the key used to rate exams was not the one developed by the Scoring Key Committee.

3. Rater Consensus

In scoring the exams, two raters independently evaluated each candidate's responses and then conferred to reach a single consensus rating on each dimension. Zedeck testified this process raised a concern because "sometimes there's a more powerful, influential rater in terms of style and personality, etcetera. And you might find a particular rater giving in more often . . . . so it's a possibility you could have undue influence." There was no evidence that any undue influence among rater pairs in fact occurred. Zedeck's testimony that such undue influence was "a possibility" does not give rise to a reasonable inference that any undue influence in fact occurred. Moreover, Zedeck testified the process of having raters reach consensus does "[n]ot by itself" invalidate the Exam.

C. Conclusion

The City conducted a thorough job analysis and designed an exam to test the important components of the lieutenant position. The Exam neither focused exclusively on a minor aspect of the position nor failed to test a required significant skill. The City relied on knowledgeable experts to create a scoring key, recruited outside raters and trained them to score the exams, and attempted to discover and correct scoring errors.

Crediting Plaintiffs' evidence and all reasonable inferences to be drawn therefrom, there were areas in which the Exam could have been improved. But the law does not require perfection. We agree with the trial court that a verdict for the City on job-relatedness was compelled as a matter of law.

DISPOSITION

The judgment is affirmed. Respondents are awarded their costs on appeal.

/s/_________

SIMONS, J. We concur. /s/_________
JONES, P.J. /s/_________
NEEDHAM, J.

Danner v. City of S.F.

Opinion

NOT TO BE PUBLISHED IN OFFICIAL REPORTS

FACTUAL BACKGROUND

PROCEDURAL BACKGROUND

DISCUSSION

I. Legal Standards

II. Analysis

DISPOSITION

Danner v. City of S.F.

Danner v. City of S.F.

Case Details

CitationsCopy Citation

Citations