Research & Resources

Products A to Z

Search by Product Name
Search by Product Acronym
Search by Author

Glossary

Basic Measurement Concepts

To locate a term quickly, click on the first letter.

A | B | C | D | E | F | G | H | I | J | K | L | M
N | O | P | Q | R | S | T | U | V | W | X | Y | Z


Ability: A characteristic indicative of an individual’s competence in a particular field. The word "ability" is frequently used interchangeably with aptitude, although many psychologists use "ability" to include what others term "aptitude" and "achievement." (See Aptitude .)

Academic Aptitude (See Scholastic Aptitude .)

Achievement/Ability Comparison (AAC): The relationship between an individual’s score on a subtest of the Stanford Achievement Test Series or the Metropolitan Achievement Tests and the scores of other students of similar ability as measured by the Otis-Lennon School Ability Test. If a student’s achievement test score is higher than those of students of similar ability, the AAC is HIGH. If the achievement score is about the same as the scores of similar-ability students, the AAC is MIDDLE; if the score is lower, the AAC is LOW.

Age Norms: The distribution of test scores by age of test takers. For example, a norms table may be provided for 9 year olds. This age-norms table would present such information as the percentage of 9 year olds who score below each raw score on the test. (See Norms .)

Anecdotal Data: Data obtained from a written description of a specific incident in an individual’s behavior (an anecdotal record). The written report should be an objective account of behavior considered significant for the understanding of the individual.

Aptitude: A combination of characteristics, whether native or acquired, that are indicative of an individual’s ability to learn or to develop proficiency in some particular area if appropriate education or training is provided. Aptitude tests include those of general academic (scholastic) ability; those of special abilities, such as verbal, numerical, mechanical, or musical; tests assessing "readiness" for learning; and tests that measure both ability and previous learning, and are used to predict future performance—usually in a specific field, such as foreign language, shorthand, or nursing.

Back to Top


Calibrated Difficulty Level: A scale value that expresses how difficult a test item is. This value differs from the conventional difficulty index. (See Difficulty Index .) The origin of the scale is arbitrary, but the lower the value, the easier the item.

Construct Validity (See Validity .)

Content Validity (See Validity .)

Correlation: The degree of relationship between two sets of scores. A correlation of 0.00 denotes a complete absence of relationship. A correlation of plus or minus 1.00 indicates a perfect (positive or negative) relationship. Correlation coefficients are used in estimating test reliability and validity.

Criterion-Referenced (Content-Referenced) Test: Terms often used to describe tests that are designed to provide information about the specific knowledge or skills possessed by a student. Such tests usually cover relatively small units of content and are closely related to instruction. Their scores have meaning in terms of what the student knows or can do, rather than in (or in addition to) their relation to the scores made by some norm group. Frequently, the meaning is given in terms of a cutoff score, for which people who score above that point are considered to have scored adequately ("mastered" the material), while those who score below it are thought to have inadequate scores.

Criterion-Related Validity (See Validity .)

Cumulative Percent (See Percentile Rank .)

Back to Top

Deviation IQ (DIQ): An age-based index of general mental ability. It is based on the difference between a person’s score and the average score for persons of the same chronological age. Deviation IQ scores from most current scholastic aptitude tests are standard scores with a mean of 100 and a standard deviation of 15 or 16 for each defined age group. Thus, the DIQ is a transformed score equal to 15 (or 16) z + 100. (See z-score and Standard Score.) Some people are moving away from calling such a score on a mental or scholastic ability test an IQ. The Otis-Lennon School Ability Test, for example, reports a School Ability Index. (See School Ability Index .)

Deviation Score (x): The score for an individual minus the mean score for the group; i.e., the amount a person deviates from the mean .

Diagnostic Test: A test used to "diagnose" or analyze; that is, to locate an individual’s specific areas of weakness or strength, to determine the nature of his or her weaknesses or deficiencies, and, if possible, to suggest their cause. Such a test yields measures of the components or subparts of some larger body of information or skill. Diagnostic achievement tests are most commonly prepared for the skill subjects.

Difference Score: Difference between two scores for the same individual.

Difference Score Reliability: Reliability of the distribution of differences between two sets of scores. These scores could be on two different subtests, or on a pre- and posttest, where the difference score is typically called a gain score. The meaning of the term "reliability" is the same for a set of difference scores as for a distribution of regular test scores. (See Reliability .) However, since difference scores are derived from two somewhat unreliable scores, difference scores are often quite unreliable. This must be kept in mind when interpreting difference scores.

Difficulty Index: The percent of students who answer an item correctly, designated as p. (At times defined as the percent who respond incorrectly, designated as q. )

Discrimination Index: The extent to which an item differentiates between high-scoring and low-scoring examinees. Discrimination indices generally can range from -1.00 to +1.00. Other things being equal, the higher the discrimination index, the better the item is considered to be. Items with negative discrimination indices are generally items in need of rewriting.

Distracters: An incorrect choice in a multiple-choice or matching item (also called a foil).

Back to Top


Equivalent Forms: Any of two or more forms of a test that are closely parallel with respect to content and the number and difficulty of the items included. Equivalent forms should also yield very similar average scores and measures of variability for a given group. Also called parallel or alternate forms.

Error of Measurement: The amount by which the score actually received (an observed score) differs from a hypothetical true score. (See also Standard Error of Measurement .)


Back to Top



Frequency: The number of times a given score (or a set of scores in an interval grouping) occurs in a distribution.

Frequency Distribution: A tabulation of scores from low to high or high to low showing the number of individuals who obtain each score or fall within each score interval.


Back to Top



Gain Score: Difference between a posttest score and a pretest score.

Grade Equivalent (G.E.): A norm-referenced score; the grade and month of the school year for which a given score is the actual or estimated average. A grade equivalent is based on a 10-month school year. If a student scores at the average of all fifth graders tested in the first month of the school year, he/she would obtain a G.E. of 5.1. If the score was the same as the average for all fifth graders tested in the eighth month, the grade equivalent would be 5.8. There are some problems with the use of grade equivalents, and caution should be used when interpreting this type of score. For example, if a student at the end of fourth grade obtains a G.E. of 8.8 on a math subtest, this does not mean that the child can do eighth-grade work. Rather, it means that the child obtained the same score as an average student in the eighth month of the eighth grade, had the eighth-grade student taken the fourth-grade test.

Grade Norms: The distribution of test scores by the grade of the test takers. (See Age Norms and Norms .)


Back to Top



Item Analysis: The process of examining students’ responses to test items to judge the quality of each item. The difficulty and discrimination indices are frequently used in this process. (See Difficulty Index and Discrimination Index .)

Item Difficulty: (See Difficulty Index .)

Item Discrimination: (See Discrimination Index .)


Back to Top



Latent-Trait Scale: A scaled score obtained through one of several mathematical approaches collectively known as Latent-Trait procedures or Item Response Theory. The particular numerical values used in the scale are arbitrary, but higher scores indicate more knowledgeable people or more difficult items. (See Scaled Score .)

Local Percentile (See Percentile .)


Back to Top



Mastery Level: The cutoff score on a criterion-referenced or mastery test. People who score at or above the cutoff score are considered to have mastered the material; people who score below the cutoff score are considered to be nonmasters. "Mastery" in this sense is an arbitrary judgment. A cutoff score can be determined by several different methods. Each method often results in a different cutoff score.

Mastery Test: A test designed to determine whether a student has mastered a given unit of instruction or a single knowledge or skill; a test giving information on what a student knows, rather than on how his or her performance relates to that of some norm group.

Mean ( ): The arithmetic average of a set of scores. It is found by adding all the scores in the distribution and dividing by the total number of scores.

Median (Md): The middle score in a distribution or set of ranked scores; the point (score) that divides a group into two equal parts; the 50th percentile. Half the scores are below the median, and half are above it.

Mode: The score or value that occurs most frequently in a distribution.


Back to Top



N: The symbol commonly used to represent the number of cases in a group.

National Percentile (See Percentile .)

Normal Curve Equivalents (NCEs): Normalized standard scores with a mean of 50 and a standard deviation of 21.06. (See Standard Score.) The standard deviation of 21.06 was chosen so that NCEs of 1 and 99 are equivalent to percentiles of 1 and 99. There are approximately 11 NCEs to each stanine. (See Stanines .)

Normal Distribution: A distribution of scores or other measures that in graphic form has a distinctive bell-shaped appearance. In a normal distribution, the measures are distributed symmetrically about the mean. Cases are concentrated near the mean and decrease in frequency, according to a precise mathematical equation, the farther one departs from the mean. The assumption that many mental and psychological characteristics are distributed normally has been very useful in test development work.

Figure 1 below is a normal distribution. It shows the percentage of cases between different scores as expressed in standard deviation units. For example, about 34% of the scores fall between the mean and one standard deviation above the mean.


Figure 1. A Normal Distribution.

Norms: The distribution of test scores of some specified group called the norm group. For example, this may be a national sample of all fourth graders, a national sample of all fourth-grade males, or perhaps all fourth graders in some local district.

Norms vs. Standards: Norms are not standards. Norms are indicators of what students of similar characteristics did when confronted with the same test items as those taken by students in the norms group. Standards, on the other hand, are arbitrary judgments of what students should be able to do, given a set of test items.

Norm-Referenced Test: Any test in which the score acquires additional meaning by comparing it to the scores of people in an identified norm group. A test can be both norm- and criterion-referenced. Most standardized achievement tests are referred to as norm-referenced.


Back to Top



Objectives: Stated, desirable outcomes of education.

Out-of-Level Testing: The activity of administering a test level that is different from the one designated for a student of a particular age or in a particular grade. For example, a fourth grader might be given a test level designated for use in Grade 2. Out-of-level testing is used so that students can be tested on the content appropriate to their current level of functioning; that is, above or below their grade placement or age.


Back to Top



p-Value: The proportion of people in an identified norm group who answer a test item correctly; usually referred to as the difficulty index. (See Difficulty Index .)

Percentile: A point on the norms distribution below which a certain percentage of the scores fall. For example, if 70% of the scores fall below a raw score of 56, then the score of 56 is at the 70th percentile. The term "local percentile" indicates that the norm group is obtained locally. The term "national percentile" indicates that the norm group represents a national group.

Percentile Band: An interpretation of a test score that takes into account measurement error. These bands, which are most useful in portraying significant differences between subtests in battery profiles, most often represent the range from one standard error of measurement below the obtained score to one standard error of measurement above it. For example, if a student had a raw score of 35, and if the standard error of measurement were 5, the percentile rank for a score of 30 to the percentile rank for a score of 40 would be the percentile band. We would be 68% confident the student’s true percentile rank falls within this band. (See Standard Error of Measurement and True Score .)

Percentile Rank: The percentage of scores falling below a certain point on a score distribution. (Percentile and percentile rank are sometimes used interchangeably.)

Profile: A graphic presentation of several scores expressed in comparable units of measurement for an individual or a group. This method of presentation permits easy identification of relative strengths or weaknesses across different tests or subtests.


Back to Top



Quartile: One of three points that divided the scores in a distribution into four groups of equal size. The first quartile [equation], or 25th percentile, separates the lowest fourth of the group; the middle quartile [equation], the 50th percentile or median, divides the second fourth of the cases from the third; and the third quartile [equation], the 75th percentile, separates the top quarter.


Back to Top



Raw Score: A person’s observed score on a test, i.e., the number correct. While raw scores do have some usefulness, they should not be used to make comparisons between performance on different tests, unless other information about the characteristics of the tests is known. For example, if a student answered 24 items correctly on a reading test, and 40 items correctly on a mathematics test, we should not assume that he or she did better on the mathematics test than on the reading measure. Perhaps the reading test consisted of 35 items and the arithmetic test consisted of 80 items. Given this additional information we might conclude that the student did better on the reading test (24/35 as compared with 40/80). How well did the student do in relation to other students who took the test in reading? We cannot address this question until we know how well the class as a whole did on the reading test. Twenty-four items answered correctly is impressive, but if the average (mean) score attained by the class was 33, the student’s score of 24 takes on a different meaning.

Readiness Test: A measure of the extent to which an individual has achieved the degree of maturity, or has acquired certain skills or information, needed to undertake some new learning activity successfully. For example, a reading readiness test indicates whether a child has reached a developmental stage at which he may profitably begin formal reading instruction.

Regression Effect: Tendency of a posttest score (or a predicted score) to be closer to the mean of its distribution than the pretest score is to the mean of its distribution. Because of the effects of regression, students obtaining extremely high or extremely low scores on a pretest tend to obtain less extreme scores on a second administration of the same test (or on some predicted measure).

Reliability: The extent to which test scores are consistent; the degree to which the test scores are dependable or relatively free from random errors of measurement. Reliability is usually expressed in the form of a reliability coefficient or as the standard error of measurement derived from it. The reliability of a major classroom achievement test should be at least .60. The reliability of a standardized achievement or aptitude test should be at least .85. The higher the reliability coefficient the better, because this means there are smaller random errors in the scores. A test (or a set of test scores) with a reliability of 1.00 would have a standard error of zero and thus be perfectly reliable. (See Standard Error of Measurement .)

Reliability Coefficients: Estimated by correlation between scores on two equivalent forms of a test, by the correlation between scores on two administrations of the same test, or through procedures known as internal-consistency estimates. Each of the three estimates pertains to a different aspect of reliability. One of the easier and more commonly used (by teachers) estimates of reliability is known as the Kuder-Richardson Formula #21 estimate. The formula is



Reliability of Difference Scores (See Difference Score Reliability .)


Back to Top



Scaled Score: A mathematical transformation of a raw score. Scaled scores are useful when comparing test results over time. Most standardized achievement test batteries provide scaled scores for such purposes. Several different methods of scaling exist, but each is intended to provide a continuous score scale across the different forms and levels of a test series.

Scaled-Score Band: An individual’s scaled score plus and minus one standard error of measurement on the scaled-score metric. We can be 68% confident that the person’s true scaled score is between the two end points of this band. (See Standard Error of Measurement and True Score .)

Scholastic Aptitude: The combination of native and acquired abilities that are needed for school learning; the likelihood of success in mastering academic work as estimated from measures of the necessary abilities.

School Ability Index (SAI): Obtained from the Otis-Lennon School Ability Test, normalized standard score with a mean of 100 and a standard deviation of 16. (See Deviation IQ and Standard Score .) An individual who had a School Ability Index of 116 would be one standard deviation above the mean, for example. This person would be at the 84th percentile for his or her age group. (See Normal Distribution .)

Standard Age Scores: Normalized standard scores provided for specified age groups on each battery of a test. Typically, standard age scores have a mean of 100 and a standard deviation of 15.

Standard Deviation (S.D.) A measure of the variability, or dispersion, of a distribution of scores. The more the scores cluster around the mean, the smaller the standard deviation. In a normal distribution of scores, 68.3% of the scores are within the range of one S.D. below the mean to one S.D. above the mean. Computation of the S.D. is based upon the square of the deviation of each score from the mean. One way of writing the formula is as follows:



(See Normal Distribution .)

Standard Error of Measurement (SEM): The amount an observed score is expected to fluctuate around the true score. For example, the obtained score will not differ by more than plus or minus one standard error from the true score about 68% of the time. About 95% of the time, the obtained score will differ by less than plus or minus two standard errors from the true score.


The SEM is frequently used to obtain an idea of the consistency of a person’s score or to set a band around a score. Suppose a person scores 110 on a test where the S.D. = 20 and [equation] = .91. Then:


We would thus say we are 68% confident the person’s true score was between (110–1 SEM) and (110+1 SEM) or between 104 and 116.

Standard Score: A general term referring to scores that have been "transformed" for reasons of convenience, comparability, ease of interpretation, etc. The basic type of standard score, known as a z-score, is an expression of the deviation of a score from the mean score of the group in relation to the standard deviation of the scores of the group. Most other standard scores are linear transformations of z-scores, with different means and standard deviations. (See z-Score .)

Standards (See Norms vs. Standards)

Stanines: Expressed as a nine-point normalized standard score scale with a mean of 5 and a standard deviation of 2. Only the integers 1 to 9 occur. The percentage of scores at each stanine is 4, 7, 12, 17, 20, 17, 12, 7, and 4, respectively. While stanines are popular, they are actually less informative than, say, percentiles. For example, for three students with percentiles of 39, 41, and 59, the first would receive a stanine of 4, and the next two stanines of 5. We would thus be misled into inferring that the latter two students were the same, and different from the first with respect to the characteristic measured, whereas in reality the first two individuals are essentially the same, and different from the third.

Sometimes, the first three stanines are interpreted as being "below average," the next three as "average," and the top three stanines as "above average." This can be quite misleading. Suppose twins, Joe and Jim, have percentiles of 22 and 24, respectively. Joe would have a stanine of 3 and be considered "below average" whereas Jim would have a stanine of 4 and be considered average.


Back to Top



T-Score: A standard score with a mean of 50 and a standard deviation of 10. Thus a T-score of 60 represents a score one standard deviation above the mean. T-scores are obtained by the following formula:


True Score: A score entirely free of error; a hypothetical value that can never be obtained by testing, since a test score always involves some measurement error. A person’s "true" score may be thought of as the average of an infinite number of measurements from the same or exactly equivalent tests, assuming no practice effect or change in the examinee during the testing. The standard deviation of this infinite number of scores is known as the standard error of measurement. (See Standard Error of Measurement .)


Back to Top



Validity: The extent to which a test does the job for which it is intended. The term validity has different connotations for different types of tests and, therefore, different kinds of validity evidence are appropriate for each.

1. Content validity: For achievement tests, content validity is the extent to which the content of the test represents a balanced and adequate sampling of the outcomes (domain) about which inferences are to be made.

Typically, but not always, we wish to make inferences about the degree to which students have learned the material in a course. In those cases, the question of content validity is a question of the match and balance between the test items and the course content. At other times we wish to make different inferences. For example, we may wish to know (make inferences about) how well a group of students can perform the basic arithmetic functions even though we have not been teaching them directly but have been teaching set theory, different number bases, exponents, etc. In such a case, the content validity of a test would be the degree to which the test questions represent a balanced and adequate sampling of the domain of "arithmetic functions." The match is always between the questions asked and the domain of behavior about which inferences are to be made.


2. Criterion-related validity: The extent to which scores on the test are in agreement with (concurrent validity) or predict (predictive validity) some criterion measure.

Predictive validity refers to the accuracy with which a test is indicative of performance on a future criterion measure, e.g., scores on an academic aptitude test administered in high school to grade-point averages over four years of college. Evidence of concurrent validity is obtained when no time interval has elapsed between the administration of the test being validated and collection of data. Concurrent validity might be obtained by administering concurrent measures of academic ability and achievement, by determining the relationship between a new test and one generally accepted as valid, or by determining the relationship between scores on a test and a less objective criterion measure.


3. Construct validity: The extent to which a test measures some relatively abstract psychological trait or construct; applicable in evaluating the validity of tests that have been constructed on the basis of an analysis of the trait and its manifestation.

Test of personality, verbal ability, mechanical aptitude, critical thinking, etc., are validated in terms of their constructs by the relationships between their scores and pertinent external data.


Variability: The spread of dispersion of test scores, most often expressed as a standard deviation. (See Standard Deviation .)

Variance: The square of the standard deviation.


Back to Top



Weighting: The process of assigning different weights to different scores in making some final decision. To do weighting correctly, one must convert all scores to a common scale or metric. For example, we cannot average temperatures measured with both the Celsius and Fahrenheit scale until the temperatures from one scale are converted to the other scale. For educational data, we should first convert all data to a common scale such as a z-score, a T-score, or some other standard score. Then, to combine scores, we must determine how much weight to give each score. Weights are usually assigned subjectively, based on the importance and/or quality, e.g., reliability, of the data.


Back to Top



z-Score: A type of standard score with a mean of zero and a standard deviation of one. (See Standard Score .)



Thus, for example, if three individuals have z-scores of -0.5, 0, and 1.0, we know the first scored one-half a standard deviation below the mean, the second scored right at the mean, and the third scored one standard deviation above the mean. If the distribution were normal these z-scores would have percentiles of about 33, 50, and 84, respectively. (See Normal Distribution .)

Back to Top