Designing Written Assessment of Student Learning

1
Designing Written Assessment of

Student Learning
Carlo Magno
Jerome Ouano
2
Chapter 1
Assessment, Measurement, and Evaluation
Chapter Objectives
1. Describe assessment in the educational and classroom setting.

2. Identify ways on how assessment is conducted in the educational setting.
3. Explain how is assessment integrated with instruction and learning.
4. Distinguish the critical features of measurement, evaluation, and assessment.
5. Provide the uses of assessment results.
Lessons
1 Assessment in the Classroom Context

2 The Role of Measurement and Evaluation in Assessment
The Nature of Measurement
The Nature of Evaluation
Forms of Evaluation
Models of Evaluation
Examples of Evaluation Studies
3 The Process of Assessment
The Process of Assessment
Forms of Assessment
Components of Classroom Assessment
Paradigm Shifts in the Practice of Assessment
Uses of Assessment
3
Lesson 1: Assessment in the Classroom Context
To better understand the nature of classroom assessment, it is important to answer three

questions: (1) What is assessment? (2) How is assessment conducted? And, (3) when is
assessment conducted?
What is How is assessment When is assessment

assessment? conducted? conducted?
It is customary in the educational setting that at the end of a quarter, trimester, or

semester, students receive a grade. The grade reflects a combination of different forms of
assessment that both the teacher and the student have conducted. These grades were based on a
variety of information that the student and teacher gathered in order to objectively come up with
a value that is very much reflective of the student’s performance. The grades also serve to
measure how well the students have accomplished the learning goals intended for them in a
particular subject, course, or training. The process of collecting various information needed to
come up with an overall information that reflects the attainment of goals and purposes is referred
to as assessment (The details of this process will be explained in the next section). The process of
assessment involves other concepts such as measurement, evaluation, and testing (The
distinction of these concepts and how they are related will be explained in the proceeding section
of the book).
The teacher and students uses various sources in coming up with an overall assessment of
the student’s performance. A student’s grade that is reflective of their performance is a collective
assessment from various sources such as recitation, quizzes, long tests, final exams, projects,
final papers, performance assessments, and the other sources. Different schools and teachers
would give certain weights to these identified criteria depending on their set goals for the subject
or course. Some schools assign weights based on the nature of the subject area, some teachers
would base it on the objectives set, and others treat all criteria set with equal weights. There is no
ideal weight for these various criteria because it will depend on the overall purpose of the
learning and teaching process, orientation of the teachers, and goals of the school.
An overall assessment should come from a variety of sources to be able to effectively use
the information in making decisions about the students. For example, in order to promote a
student on the next grade or year level, or move to the next course, the information taken about
the student’s performance should be based on multiple forms of assessment. The student should
have been assessed in different areas of their performance to make valid decisions such as for
their promotion, deciding the top pupils, honors, and even failure and being retained to the
current level. These sources come from objective assessments of learning such as several
quizzes, a series of recitation, performance assessments on different areas, and feedback. These
forms of assessment are generally given in order to determine how well the students can
demonstrate a sample of their skills.
4
Assessment is integrated in all parts of the teaching and the learning process. This means
that assessment can take place before instruction, during instruction, and after instruction. Before
instruction, teachers can use assessment results as basis for the objectives and instructions for
their plans. These assessment results come from the achievement tests of students from the
previous year, grades of students from the previous year, assessment results from the previous
lesson or pretest results before instruction will take place. Knowing the assessment results from
different sources prior to planning the lesson help teachers decide on a better instruction that is
more fit for the kind of learners they will handle, set objectives appropriate for their
developmental level, and think of a better ways of assessing students to effectively measure the
skills learned. During instruction, there are many ways of assessing student performance. While
class discussion is conducted, teachers can ask questions and students can answer them orally to
assess whether students can recall, understand, apply, analyze, evaluate, and synthesize the facts
presented. During instruction teachers can also provide seat works and work sheets on every unit
of the lesson to determine if students have mastered the skill needed before moving to the next
lesson. Assignments are also provided to reinforce student learning inside the classroom.
Assessment done during instruction serves as formative assessment where it is meant to prepare
students before they are finally assessed on major exams and tests. When the students are ready
to be assessed after instruction took place, they are assessed in a variety of skills they are trained
for which then serves as a summative form of assessment. Final assessments come in the forms
of final exams, long tests, and final performance assessment which covers larger scope of the
lesson and more complex skills are required to be demonstrated. Assessments conducted at the
end of the instruction are more structured and announced where students need time to prepare.
Review Questions:
1. What are the other processes involved in assessment?

2. Why should there be several sources of information in order to come up with an
overall assessment?
3. What are the different purposes of assessment when conducted before, during, and
after assessment?
4. Why is assessment integrated in the teaching and learning process?
Activity #1
Ask a sample of students the following questions:
1. Why do you think assessment is needed in learning?

2. What are the different ways of assessing student learning in the courses you are taking?
Tabulate the answers and present the answers in class

5
Lesson 2
The Role of Measurement and Evaluation in Assessment
The concept of assessment is broad that it involves other processes such as measurement
and evaluation. Assessment involves several measurement processes in order to arrive with
quantified results. When assessment results are used to make decisions and come up with
judgments, then evaluation takes place.
Measurement
Assessment Evaluation
The Nature of Measurement
Measurement is an important part of assessment. Measurement has the features of

quantification, abstraction, and further analysis that is typical in the process of science. Some
assessment results come in the forms of quantitative values that enable the use of further
analysis.
Obtaining evidence of different phenomena in the world can be based on measurement. A
statement can be accepted as true or false if the event can be directly observed. In the educational
setting, before saying that a student is “highly intelligent,” there must be observable proofs to
demonstrate that the student is indeed “highly intelligent.” The people involved in identifying
whether a student is “highly gifted” will have to gather evidence accurate information to claim
the student as such. When people start demonstrating certain characteristics such as
“intelligence,” by making a judgment, obtaining a high test score, exemplified performance in
cognitive tasks, high grades, then measurement must have taken place. If measurement is
carefully done, then the process meets the requirements of scientific inquiry.
Objects per se are not measured, what is measured are the characteristics or traits of
objects. These measurable characteristics or traits are referred to as variables. Examples of
variables that are studied in the educational setting are intelligence, achievement, aptitude,
interest, attitude, temperament, and others.
Nunnaly (1970) defined measurement as “consist of rules for assigning numbers to
objects in such a way as to represent quantities of attributes.” Measurement is used to quantify
characteristics of objects. Quantification of characteristics or attributes has advantages:
6
1. Quantifying characteristics or attributes determines the amount of that attribute

present. If a student was placed in the 10th percentile rank on an achievement test, then that
means that the student has achieved less in reference to others. A student who got a perfect score
on a quiz on the facts about the life of Jose Rizal means that the student has remembered enough
information about Jose Rizal.
2. Quantification facilitates accurate information. If a student gets a standard score of -2

on a standardized test (standard scores ranges from -3 to +3 where 0 is the mean), it means that
the student is below average on that test. If a student got a stannine score of 8 on a standardized
test (stannine scores ranges from 1 to 9 where 5 is the average), it means that the student is above
the average or have demonstrated superior ability on the trait measured by the standardized test.
3. Quantification allows objective comparison of groups. Suppose that male and female
students were tested in their math ability using the same test for both groups. Then mean results
of the males math scores is 92.3 and the mean results of the females math scores is 81.4. It can
be said that males performed better in the math test than females when tested for significance.
4. Quantification allows classification of groups. The common way of categorizing

sections or classes is based on students’ general average grade from the last school year. This is
especially true if there are designated top sections within a level. In the process, students grades
are ranked from highest to lowest and the necessary cut-offs are made depending on the number
of students that can be accommodated in a class.
5. Quantification results make the data possible for further analysis. When data is
quantified, teachers, guidance counselors, researchers, administrators, and other personnel can
obtain different results to summarize and make inferences about the data. The data may be
presented in charts, graphs, and tables showing the means and percentages. The quantified data
can be further estimated using inferential statistics such as when comparing groups,
benchmarking, and assessing the effectiveness of an instructional program.
The process of measurement in the physical sciences (physics, chemistry, biology) is

similar in education and the social sciences. Both use instruments or tools to arrive with
measurement results. The only difference is the variables of interest being measured. In the
physical sciences, measurement is more accurate and precise because of the nature of physical
data which is directly observable and the variables involved are tangible in all senses. In
education, psychology, and behavioral science, the data is subject to measurement errors and
large variability because of individual differences and the inability to control variations in the
measurement conditions. Although in education, psychology, and behavioral science, there are
statistical procedures for obtaining measurement errors such as reporting standard deviations,
standard errors, and variance.
Measurement facilitates objectivity in the observation. Through measurement, extreme
differences in results are avoided, provided that there is uniformity in conditions and individual
differences are controlled. This implies that when two persons measure a variable following the
same conditions, they should be able to get consistent results. Although there may be slight
difference (especially if the variable measured is psychological in nature), but the results should
7
be at least consistent. Repeating the measurement process several times and consistency of
results would mean objectivity of the procedure undertaken.
The process of measurement involves abstraction. Before a variable is measured using an
instrument, the variable’s nature needs to be clarified and studied well. The variable needs to be
defined conceptually and operationally to identify ways on how it is going to be measured.
Knowing the conceptual definition based on several references will show the theory or
conceptual framework that fully explains the variable. The framework reveals whether the
variable is composed of components or specific factors. Then these specific factors need to be
measured that comprise the variable. A characteristic that is composed of several factors or
components are called latent variables. The components are usually called factors, subscales, or
manifest variables. An example of a latent variable would be “achievement.” Achievement is
composed of factors that include different subject areas in school such as math, general science,
English, and social studies. Once the variable is defined and its underlying factors are identified,
then the appropriate instrument that can measure the achievement can now be selected. When the
instrument or measure for achievement is selected, it will now be easy to operationally define the
variable. Operational definition includes the procedures on how a variable will be measured or
made to occur. For example, ‘achievement’ can be operationally defined as measured by the
Graduate Record Examination (GRE) that is composed of verbal, quantitative, analytical,
biology, mathematics, music, political science, and psychology.
When a variable is composed of several factors, then it is said to be multidimensional.
This means that a multidimensional variable would require an instrument with several subtests in
order to directly measure the underlying factors. A variable that do not have underlying factors is
said to be unidimensional. A unidimensional variable only measures an isolated unitary attribute.
An example of unidemensional measures are the Rosenberg self-esteem scale and the Penn State
Worry Questionnaire (PSWQ). Examples of multidimensional measures are various ability tests
and personality tests where it is composed of several factors. The 16 PF is a personality test that
is composed of 16 components (researved, more intelligent, affected by feelings, assertive, sober,
conscientious, venturesome, tough-minded, suspicious, practical, shrewd, placid, experimenting,
self-sufficient, controlled, and relaxed).
The common tools used to measure variables in the educational setting are tests,
questionnaires, inventories, rubrics, checklists, surveys and others. Tests are usually used to
determine student achievement and aptitude that serve a variety of purposes such as entrance
exam, placement tests, and diagnostic tests. Rubrics are used to assess performance of students in
their presentations such as speech, essays, songs, and dances. Questionnaires, inventories, and
checklists are used to identify certain attributes of students such as their attitude in studying,
attitude in math, feedback on the quality of food in the canteen, feedback on the quality of
service during enrollment, and other aspects.
The Nature of Evaluation
Evaluation is arrived when the necessary measurement and assessment have taken place.
In order to evaluate whether a student will be retained or promoted to the next level, different
aspects of the student’s performance were carefully assessed and measured such as the grades
and conduct. To evaluate whether the remedial program in math is effective, the students’
improvement in math, teachers teaching performance, students attitude change in math should be
carefully assessed. Different measures are used to assess different aspects of the remedial
8
program to come up with an evaluation. According to Scriven (1967) that evaluation is “judging
the worth or merit” of a case (ex. student), program, policies, processes, events, and activities.
These objective judgments derived from evaluation enable stakeholders (a person or group with
a direct interest, involvement, or investment in the program) to make further decisions about the
case (ex. students), programs, policies, processes, events, and activities.
In order to come up with a good evaluation, Fitzpatrick, Sanders, and Worthen (2004)
indicated that there should be standards for judging quality and deciding whether those standards
should be relative or absolute. The standards are applied to determine the value, quality, utility,
effectiveness, or significance of the case evaluated. In evaluating whether a university has a good
reputation and offers quality education, it should be comparable to a standard university that
topped the World Rankings of University. The features of the university evaluated should be
similar with the standard university selected. Or a standard can be in the form of ideal objectives
such as the ones set by the Philippine Accreditation of Schools, Colleges, and Universities
(PAASCU). A university is evaluated if they can meet the necessary standards set by the external
evaluators.
Fitzpatrick, Sanders, and Worthen (2004) clarified the aims of evaluation in terms of its
purpose, outcome, implication, setting of agenda, generalizability, and standards. The purpose of
evaluation is to help those who hold a stake in whatever is being evaluated. Stakeholders consist
of many groups such as students, teachers, administrators, and staff. The outcome of evaluation
leads to judgment whether a program is effective or not, whether to continue or stop a program,
whether to accept or reject a student in the school. The implication that evaluation gives is to
describe the program, policies, organization, product, and individuals. In setting the agenda for
evaluation, the questions for evaluation come from many sources, including the stakeholders. In
making generalizations, a good evaluation is specific to the context in which the evaluation
object rests. The standards of a good evaluation are assessed in terms of its accuracy, utility,
feasibility, and propriety.
A good evaluation adheres to the four standards of accuracy, utility, feasibility, and
propriety set by the ‘Joint Committee on Standards for Educational Evaluation’ headed by
Daniel Stufflebeam in 1975 at Western Michigan University’s Evaluation Center. These four
standards set are now referred to as ‘Standards for Evaluation of Educational Programs, Projects,
and Materials.’ Table 1 presents the description of the four standards.
9
Table 1
Standards for Evaluation of Educational Programs, Projects, and Materials
Standard Summary Components

Utility Intended to ensure that an evaluation Stakeholder identification, evaluator credibility,
will serve the information needs of information scope and selection, values identification,
its intended users. report clarity, report timeliness and dissemination,
evaluation impact
Feasibility Intended to ensure that an evaluation Practical procedures, political viability, cost
will be realistic, prudent, diplomatic, effectiveness
and frugal.
Propriety Intended to ensure that an evaluation Service orientation, formal agreements, rights of human
will be conducted legally, ethically, subjects, human interaction, complete and fair
and with due regard for the welfare assessment, disclosure of findings, conflict of interest,
of those involved in the evaluation as fiscal responsibility
well as those affected by its results.
Accuracy Intended to ensure that an evaluation Program documentation, content analysis, described
will reveal and convey technical purposes and procedures, defensible information
adequate information about the sources, valid information, reliable information,
features that determine the worth or systematic information, analysis of quantitative
merit of the program being evaluated. information, analysis of qualitative information, justified
conclusions, impartial reporting, metaevaluation
Forms of Evaluation
Owen (1999) classified evaluation according to its form. He said that evaluation can be
proactive, clarificative, interactive, monitoring, and impact.
1. Proactive. Ensure that all critical areas are addressed in an evaluation process.
Proactive evaluation is conducted before a program begins. It assists stakeholders in making
decisions on determining the type of program needed. It usually starts with needs assessment to
identify the needs of stakeholders that will be implemented in the program. A review of literature
is conducted to determine the best practices and creation of benchmarks for the program.
2. Clarificative. This is conducted during program development. It focuses on the

evaluation of all aspects of the program. It determines the intended outcomes and how the
program designed will achieve them. Determining the how the program will achieve its goals
involves determining the strategies that will be implemented.
3. Interactive. This evaluation is conducted during program development. It focuses on

improving the program. It identifies what is the program trying to achieve, whether the goals are
consistent with the plan, and how can the program be changed to make the goals effective.
4. Monitoring. This evaluation is conducted when the program has settled. It aims to
justify and fine tune the program. It focuses whether the outcome of the program has delivered to
its intended stakeholders. It determines the target population, whether the implementation meets
the benchmarks, be changed to be done in the program to make it more efficient.
10
5. Impact. This evaluation is conducted when the program is already established. It

focuses on the outcome. It evaluates if the program was implemented as planned, whether the
needs were served, whether the goals are attributable to the program, and whether the program is
cost effective.
These forms of evaluation are appropriate at certain time frames and stage of a program.
The illustration below shows when each evaluation is appropriate.
Program Duration
Planning and Implementation Settled
Development
Phase
Proactive Interactive and monitoring Impact
Clarificative
Models of Evaluation
Evaluation is also classified according to the models and framework used. The
classifications of the models of evaluation are objectives-oriented, management oriented,
consumer-oriented, expertise-oriented, participant-oriented, and theory driven.
1. Objectives-oriented. This model of evaluation determines the extent to which the

goals of the program are met. The information that results in this model of evaluation can be
used to reformulate the purpose of the program evaluated, the activity itself, and the assessment
procedures used to determine the purpose or objectives of the program. In this model there
should be a set of established program objectives and measures are undertaken to evaluate which
goals were met and which goals were not met. The data is compared with the goals. The specific
models for the objectives-oriented are the Tylerian Evaluation Approach, Metfessel and
Michael’s Evaluation Paradigm, Provus Discrepancy Evaluation Model, Hammond’s Evaluation
Cube, and Logic Model (see Fitzpatrick, Sanders, & Worthen, 2004).
2. Management-oriented. This model is used to aid administrators, policy-makers,

boards and practitioners to make decisions about a program. The system is structured around
inputs, process, and outputs to aid in the process of conducting the evaluation. The major target
of this type of evaluation is the decision-maker. This form of evaluation provides the information
needed to decide on the status of a program. The specific models of this evaluation are the CIPP
(Context, Input, Process, and Product) by Stufflebeam, Alkin’s UCLA Evaluation Model, and
Patton’s Utilization-focused evaluation (see Fitzpatrick, Sanders, & Worthen, 2004).
3. Consumer-oriented. This model is useful in evaluating whether is product is

feasible, marketable, and significant. A consumer-oriented evaluation can be undertaken to
determine if there will be many enrollees of a school that will be built on a designated location,
will there be takers of a graduate program proposed, and is the course producing students that are
employable. Specific models for this evaluation are Scriven’s Key Evaluation Checklist, Ken
Komoski’s EPIE Checklist, Morrisett and Stevens Curriculum Materials Analysis System
(CMAS) (see Fitzpatrick, Sanders, & Worthen, 2004).
11
4. Expertise-oriented. This model of evaluation uses an external expert to judge an

institution’s program, product, or activity. In the Philippine setting, the accreditation of schools
is based on this model. A group of professional experts make evaluations based in the existing
school documents. These group of experts should complement each other in producing a sound
judgment of the school’s standards. This model comes in the form of formal professional reviews
(like accreditation), informal professional reviews, ad hoc panel reviews (like funding agency
review, blue ribbon panels), ad hoc individual reviews, and educational connoisseurship (see
Fitzpatrick, Sanders, & Worthen, 2004).
5. Participant-oriented. The primary concern of this model is to serve the needs of those
who participate in the program such as students and teachers in the case of evaluating a course.
This model depends on the values and perspectives of the recipients of an educational program.
The specific models for this evaluation is Stake’s Responsive evaluation, Patton’s Utilization-
focused evaluation, Rappaport’s Empowerment Evaluation (see Fitzpatrick, Sanders, &
Worthen, 2004).
6. Program Theory. This evaluation is conducted when stakeholders and evaluators

intend to determine to understand both the merits of a program and how its transformational
processes can be exploited to improve the intervention (Chen,2005). The effectiveness of a
program in a theory driven evaluation takes into account the causal mechanism and its
implementation processes. Chen (2005) identified three strengths of the program theory
evaluation: (1) Serves accountability and program improvement needs, (2) establish construct
validity on the parts of the evaluation process, and (3) increase internal validity. Program theory
measures the effect of program intervention on outcome as mediated by determinants. For
example, a program implemented instructional and training public school students on proper
waste disposal, the quality of the training is assessed. The determinants of the stakeholders are
then identified such as adaptability, learning strategies, patience, and self-determination. These
factors are measured as determinants. The outcome measures are then identified such as the
reduction of wastes, improvement of waste disposal practices, attitude change, and rating of
environmental sanitation. The effect of the intervention on the determinants is assessed and the
effect of determinants on the outcome measures. The direct effect of the intervention and the
outcome is also assessed. The model of this evaluation is illustrated below.
Figure 1
Implicit Theory for Proper Waste Disposal
Determinants
Adaptability, learning
Intervention Outcome
strategies, patience,
and self-determination
Quality of Instruction Reduction of wastes,
and Training improvement of waste
disposal practices,
attitude change, and
rating of environmental
sanitation
12
Table 2
Integration of the Forms and Models of Evaluation
Form of Evaluation Focus Models of Evaluation

Proactive Is there a need? What do we/others know Consumer-oriented
about the problems to be addressed? Best Identifying Context in CIPP
practices?
Clarificative What is program trying to achieve? Is Setting goals in Tyler’s Evaluation
delivery working, consistent with plan? How Approach
could the program or organization be
changed to be more effective?
Interactive What is the program trying to achieve? Is Stake’s Responsive Evaluation
delivery working, consistent with plan? How Objectives-oriented
could the program or organization be
changed to be more effective?
Monitoring Is the program reaching the target CIPP
population? Is implementation meeting
benchmarks? Differences across sites, time?
How/what can be changed to be more
efficient,
effective?
Impact Is the program implemented as planned? Are CIPP
stated goals achieved? Are needs served? Objectives-oriented
Can you attribute goal achievement to Program theory
program? Unintended outcomes? Cost
effective?
13
Table 3
Implementing procedures of the Different Models of Evaluation
Form of Focus Models of Evaluation

Evaluation
Objectives-oriented Tylerian Evaluation Approach 1. Establish broad goals
2. Classify the goals
3. Define objectives in behavioral terms
4. Find situations in which achievement of objectives can be shown
5. Develop measurement techniques
6. Collect performance data
7. Compare performance data with behaviorally shared objectives.
Metfessel and Michael’s Evaluation 1. Involve stakeholders as facilitators in program evaluation
Paradigm 2. Formulate goals
3. Translate objectives into communicable forms
4. Select instruments to furnish measures
5. Carry out periodic observation
6. Analyze data
7. Interpret data using standards
8. Develop recommendations for further implementation
Provus Discrepancy Evaluation Model 1. Agreeing on standards
2. Determine whether discrepancy exist between performance and standards
3. Use information on discrepancies to decide whether to improve, maintain, or
terminate the program.
Hammond’s Evaluation Cube 1. Needs of stakeholders
2. Characteristics of the clients
3. Source of service
Logic Model 1. Inputs
2. Service
3. Outputs
4. Immediate, intermediate, long-term, and ultimate outcomes
Management-oriented CIPP (Context, Input, Process, and 1. Context evaluation
Product) by Stufflebeam 2. Input evaluation
3. Process evaluation
4. Product evaluation
Alkin’s UCLA Evaluation Model 1. Systems assessment
2. Program planning
3. Program implementation
4. Program improvement
5. Program certification
Patton’s Utilization-focused evaluation 1. Identifying relevant decision makers and information users
2. What information is needed by various people
3. Collect and provide information
Consumer-oriented Scriven’s Key Evaluation Checklist 1. Evidence of achievement
2. Follow-up results
3. Secondary and unintended efforts
4. Range of utility
5. Moral considerations
6. Costs
Morrisett and Stevens Curriculum 1. Describe characteristics of product
Materials Analysis System (CMAS) 2. Analyze rationale and objectivity
3. Consider antecedent conditions
4. Consider content
5. Consider instructional theory
6. Form overall judgment
Expertise-oriented Formal Professional reviews Accreditation
Informal Professional Reviews Peer reviews
Ad Hoc Panel Reviews Funding agency review, blue ribbon panels
Ad Hoc Individual Reviews Consultation
Educational Connoisseurship Critics
Participant-oriented Stake’s Responsive Evaluation 1. Intents
2. Observations
3. Standards
4. Judgments
Fetterman’s Empowerment Evaluation 1. Training
2. Facilitation
3. Advocacy
4. Illumination
5. Liberation
Program Theory • Determinant mediating the 1. Establish common understanding between stakeholders and evaluator
relationship between 2. Clarifying stakeholders theory
intervention and outcome 3. Constricting research design
• Relationship between program
components that is conditioned
by a third factor
14
EMPIRICAL REPORTS
Examples of Evaluation Studies
Program Evaluation of the Civic Welfare solidarity and collaboration with the immersion
Training Services centers.
By Carlo Magno
The NSTPCW1 and NSTPCW2 of a An evaluation of the Community Service

College was evaluated using Stakes’ Responsive Program of the De La Salle University-
Evaluation. The NSTP offered by the college is College of Saint Benilde
the Civic Welfare Training Service (CWTS) which By Josefina Otarra-Sembrabo
focuses on developing students’ social concern,
values, volunteerism and service for the general Community Service Program is an
welfare of the community. The main purpose of outreach program in line with the mission-vision
the evaluation is to determine the impact of the of De La Salle-College of Saint Benilde (DLS-
current NSTPCW1 and NSTPCW2 program CSB). The Benildean core values are realized
offered by DLS-CSB by assessing (1) students through a direct service to marginalized sectors
values, management strategies, and awareness in the society. The students are tasked to have
of social issues, (2) students performance during immersion with the marginalized such as the
the immersion, (3) students insights after street children, elderly, special people, and the
immersion, (4) teaching performance, and (7) like. After their service in the community,
strengths and weaknesses of the program. The students reflect on what they do and formulate
evaluation of the outcome of the program shows insights and relate it to the Lasallian education.
that the impact on values is high, the impact of This service is a social transformation for
the components of the NSTPCW2 is high, and students and community.
the awareness of social issue are also high. The To evaluate the Community Service
students’ insights show the acquisition of skills, Program (CSP), Stufflebeam’s Context-Input-
values and awareness also concords with the Process-Product Evaluation was utilized. This
impact gained. There is agreement that the type of evaluation focuses on the decision-
students are consistently present and they show management strategy. In the model, continuous
high rating on service, involvement and attitude feedback is important that is needed for better
during the immersion activity. The more the decisions and improvement of the program. This
teacher uses a learner-centered approach, the framework has four types which include context,
better is the outcome on the students part. The input, process, and product. The context
strength of NSTPCW1 includes internal and evaluation determines if the objectives of the
external aspects and the weaknesses are on the program has been met. It aims to know if the
teachers, class activities and social aspect. For objectives of the CSP have been achieved in
NSTPCW2, the strengths are on student relation to the mission and vision of DLS-CSB.
learning, activities and formation while the The input evaluation describes the respondents
weaknesses are on the structure, activities, and beneficiaries of the CSP. Process evaluation
additional strategies and the outreach area. describes how the program was implemented in
When compared with the Principle on Social terms of procedures, policies, techniques, and
Development of the Lasallian Guiding Principle, strategies. This provides the evaluators the
generally the NSTP program is acceptable in needed information to determine the procedural
terms of the standards on understanding of social issues and to interpret the outcome of project. In
reality and social intervention and developing on the product evaluation, the outcome information
15
is being related to the objectives and context, assembly, group meetings, leadership training,
input and process information. The information orientation seminar, initial area visit, immersion,
will be used to decide on whether to terminate, group processing, and submission of documents.
modify or refocus a program. The students rated it as moderate as well. Seven
There were a total of 250 participants in out 10 of the procedures need improvement. In
the study composed of students, beneficiaries, the role of the students, 68 of the students
program staff members and selected clients. The considered the role of advisers as helpful.
instruments used were three sets of evaluation However, the effectiveness of the performance
questionnaires for the students, program was rated only moderately satisfactory. Three
implementers, and beneficiaries and one strong points given to the CSP are the provision
interview guide used for the recipients of the of opportunities to gain social awareness,
CSP. Data analysis was both quantitative and actualizing social responsibility and personal
qualitative in nature. growth of the students. Subsequently, the
For the context evaluation, the weakness includes difficulty of program
evaluators looked into the objectives of the CSP, procedure, processes, locations and negative
mission-vision of CSB, objectives of Social Action attitude of the students. Some of the
Office, and their congruence. The DLS-CSB recommendations focus on program preparation,
mission vision is realized in the six core program staff and community service locations.
Benildean values, and to realize the mission- For the insights of the beneficiaries, some
vision, SAO created a CSP to enhance social problems such as attendance and seriousness of
awareness of the students and instill social the students are taken into account and resolved
responsibility. Likewise, the objectives of CSP through dialogue, feedback and meetings. They
are aligned also to CSB mission and vision. 75% also suggested to the CSP more intensive
of the respondents said that CSP objectives are orientation and preparation as well as closer
in line with the CSB mission-vision. This was coordination and program continuity.
supported with actual experiences. Moderate Lastly, for the product evaluation,
extent was given by the students and internalization and personification of the core
beneficiaries as to the extent the community Benildean values, benefits gained by the
service program has met. students and beneficiaries were taken into
For the input evaluation, the profile of the account. For the internalization and
students, program recipients, and implementers personification, it appears that four out of 6 core
were reported. Most of the students were males, values are manifested by the students which are
average age was 21 and from Manila. The deeply rooted faith, appreciation of individual
recipients are mostly centers from Metropolis by uniqueness, professional competency and
the religious groups. Program implementers on creativity. Students also gained personal benefits
the other hand are staff member responsible for such as increased social awareness, social
the implementation of the program and has been responsibility actualization, positive values, and
into the college for 1-5 years. realizations of their blessings. On the other
The process evaluation of the program hand, the beneficiaries benefits include long term
focused on the policies and procedures of the and short term benefits. Short ions are the
CSP, role of the community service adviser, socialization activities, interaction between the
strength and weaknesses of the CSP, students and clients, material help, manpower
recommendation for improvement, and insights assistance and tutorial classes while long term
of the program beneficiaries. In terms of policies, are values inculcated to the children,
the CSP is a requirement for the CSB students interpersonal relationships, knowledge imparted
written in the Handbook. The program has 10 to them, and contribution to physical growth. The
procedures including application, general program beneficiaries also identifies strengths of
16
CSP such as development of inner feelings of regular basis, student training, production of
happiness, love and concern as a result of their documentations and organized reports of the
interaction with the students, knowledge imparted students, systematize community service, more
to them and extension of material help through volunteers, exp[and the coverage of marginalized
the program. The weakness in one hand also sectors, considering other locations of
includes the lack of preparation and interaction marginalized sectors, informing the students their
with the beneficiaries. specific roles in the community service,
These findings are the basis of involvement of the community service unit to
conclusion. DLS-CSB has indeed a clear vision seminars and conferences, periodic program
for their students and it was actualized in the evaluation, assessment of students involvement
CSP. There is a need to strengthen the relation in the sectors, systematize be=needs
of the CSP objectives and college vision0mission assessment and conduct longitudinal studies with
as implied in the moderate ratings in the the effects of CSP in the lives of previous CSP
evaluation. There seems be the need for volunteers.
expansion o0f the coverage of program Shorter Summary
recipients since it does not fully address the This article deals on how the DLSU-CSB
objectives set in the CSP. A review and update community service program (CSP) has been
with procedures is needed due to the problems evaluated through hr use of Stufflebeam’s
problems encountered by the students and Context-Input-Process product Evaluation Model.
beneficiaries. The CSP advisers were also not This type of evaluation focuses on the decision-
able to perform their roles well from the point of management strategy. In here, continuous
the students and representative of centers. The feedback is important that is needed for better
weakness pointed in this program implies that decisions and improvement of the program. This
there is a need for improvement especially in the framework has four types which include context,
procedural stage. More intensive preparation input, process and product. The context
should be done both tint he implementation and evaluation determines if the objectives of the
interacting with the marginalized sectors due to program has been met. In here, it aims to know if
the need to better understand the sector they are the objectives of the CSP have been achieved in
to serve. Continuity of the program was highly relation to the mission and vision of DLSU. The
recommended due to the short term and input evaluation describes the respondents and
repetitive activities, which will allow them to beneficiaries of the CSP. Process evaluation
successfully inculcate all of the core benildean describes how the program was implemented in
values. However, the integration of these core terms of procedures, policies, techniques, and
values does not vary among the students in strategies. This provides the evaluators the
terms of sex, year of entry and course. All in all, needed information to determine the procedural
the community service program proved to be issues and to interpret the outcome of project. In
beneficial for the students, beneficiaries and the product evaluation, the outcome information
recipients of the program. is being related to the objectives and context,
In regard to the finding and conclusion, input and process information. The information
there are some recommendations with the CSP. will be used to decide on whether to terminate,
Recommendations include continuity, changes modify or refocus a program. To do this, a total of
and improvement by taking into consideration the 250 participants in the study composed of
flaws and weakness of the previous program. students, beneficiaries, program staff members
Intensive preparation for the service, review of and selected clients are included in the
the load of the students so they could give quality evaluation. The instruments used were three sets
service to the sectors, improvement in the of evaluation questionnaires for the students,
procedural stages, implementation of CSP on a program implementers, and beneficiaries and
17
one interview guide used for the recipients of the recipients of the program. In regard to the finding
CSP. Data analysis was both quantitative and and conclusion, there are some
qualitative in nature. For the context evaluation, recommendations with the CSP.
the evaluators looked into the objectives of the Recommendations include continuity, changes
CSP, mission-vision of CSB, objectives of SAO, and improvement by taking into consideration the
and their congruence with each other. For the flaws and weakness of the previous program.
input evaluation, the profile of the students, Intensive preparation for the service, review of
program recipients, and implementers were the load of the students so they could give quality
reported. The process evaluation of the program service to the sectors, improvement in the
focused on the policies and procedures of the procedural stages, implementation of CSP on a
CSP, role of the community service adviser, regular basis, student training, production of
strength and weaknesses of the CSP, documentations and organized reports of the
recommendation for improvement, and insights students, systematize community service, more
of the program beneficiaries. Lastly, for the volunteers, expand the coverage of marginalized
product evaluation, internalization and sectors, considering other locations of
personification of the core Benildean values, marginalized sectors, informing the students their
benefits gained by the students and beneficiaries specific roles in the community service,
were taken into account. The findings were used involvement of the community service unit to
as the basis of conclusion. DLS-CSB has indeed seminars and conferences, periodic program
a clear vision for their students and it was evaluation, assessment of students involvement
actualized in the CSP. There is a need to in the sectors, systematize needs assessment
strengthen the relation of the CSP objectives and and conduct longitudinal studies with the effects
college vision0mission as implied in the of CSP in the lives of previous CSP volunteers.
moderate ratings in the evaluation. There seems
be the need for expansion o0f the coverage of
program recipients since it does not fully address World Bank Evaluation Studies on
the objectives set in the CSP. A review and Educational Policy
update with procedures is needed due to the By Carlo Magno
problems problems encountered by the students
and beneficiaries. The CSP advisers were also This report provides a panoramic view of
not able to perform their roles well from the point different studies on education sponsored by the
of the students and representative of centers. world bank focusing on the evaluation
The weakness pointed in this program implies component. The report specifically presents
that there is a need for improvement especially in completed studies on educational policy from
the procedural stage. More intensive preparation 1990 to 2006. A panoramic view of the studies
should be done both tint he implementation and are presented showing the area of investigation,
interacting with the marginalized sectors due to evaluation model, method used, and
the need to better understand the sector they are recommendations. A synthesis of these reports is
to serve. Continuity of the program was highly shown in terms of the areas of investigation,
recommended due to the short term and content, methodology, and model used through
repetitive activities, which will allow them to vote counting. The vote counting is a modal
successfully inculcate all of the core Benildean categorization assumed to give the best estimate
values. However, the integration of these core of selected criteria (Bushman, 1997).
values does not vary among the students in The World Bank provides support to
terms of sex, year of entry and course. All in all, education systems throughout the developing
the community service program proved to be world. Such support is broadly aimed at helping
beneficial for the students, beneficiaries and countries attain the objectives of “Education for
18
All” and education for success in the knowledge recognition of globalization on some countries
economy. An important goal is to tailor Bank like Vanuatu.
assistance to region- and country-specific factors
such as demographics, culture, and the socio- Table 1
economic or geopolitical climate. Consequently, Counts of Area of Investigation From 1990 - 2006
a top priority is to inform development assistance
Year Country Area of Investigation No. of Total no. of
with the benefit of country-specific analysis Studies Studies per
examining (1) what factors drive education year
2006 Vanuatu Language learning 1 1
outcomes; (2) how do they interact with each 2005 None 0 0
2004 Indonesia Undergraduate/Tertiary 2 5
other; (3) which factors carry the most weight and Thailand Education
which actions are likely to produce the greatest Senegal Adult Literacy 1
Different Early Child Development 2
result; and (4) where do the greatest risks and Regions,
Columbia
constraints lie. The world bank divided the 2003 Thailand Undergraduate/Tertiary 1 2
countries according to different regions such as Different
Education
AIDS/HIV Prevention 1
Sub-Saharan Africa, East Asia and the Pacific, Regions
2002 Different Textbook/Reading materials 1 2
Europe and Central Asia, Latin America and the regions
Carribean, and Middle East and North Africa. 2001
Africa
Brazil
Secondary Education
Early Child Development
1
1 2
China Secondary Education 1
2000 Different School Self-evaluation 1 7
Areas of Investigation Regions
Different Early Child Development 1
Regions
There are 28 studies done on Pakistan, Basic Education 3
Cuba
educational policy with a manifested evaluation Africa Adult Literacy 1
Africa Tertiary Distance Education 1
component. Education studies with no evaluation 1999 USA Test Evaluation 1 3
aspect were not included. A synopsis of each Different Infant Care 1
Regions
study with the corresponding methodology and Early Child Development 1
1998 Different Teacher Development 1 2
recommendations are found in appendix A. The Regions
different areas of investigation were enumerated Different ICT 1
Regions
and the number of studies conducted for each 1997 None 0 0
1996 Different Basic Education (school 1 2
according to the sequence of years were counted Regions financing)
as shown in Table 1. Most of the studies on Chile ICT 1
1995 None 0
educational policy are targeting the basic needs 1994 Philippines Vocational Education 1 1
1993 None 0
of a country and specified region of the world 1992 Different Secondary Education 1 1
such as the effectiveness of education in the Regions
Total=28
basic education, tertiary, critical periods such as
child development programs and promoting adult
It is shown in table 1 that most studies on
literacy. From the earliest period (1990’s) the
educational policy were conducted for the year
trend of the studies done are on information and
2000 since it is a turning point of the century. For
communications technology (ICT) on basic
the coming of a new century much is being
education. The pattern for the 21st century
prepared and this is operationalized by assessing
studies shows a concentration in evaluating the
a world wide report on what has been
implementation of tertiary education across
accomplished from the recent 20th century. The
countries. This is critical since developing nations
studies typically covers a broad range of
rely on the expertise produced by its manpower
education topics such as school self-evaluation,
in the field of science and technology. For the
early child development, basic education, adult
latest period, a new area of investigation which is
literacy, and tertiary distance education. These
language learning was explored due to the
areas of investigation cover most of the fields
19
done for the 20th century and an overall view of programs around the world which is continuing
what has been accomplished was reported. It and needs to be evaluated in terms of its
can also be noted that there is an increase of effectiveness at a certain period of time. Much of
studies conducted at the start of the 21st century. the concern is on early child development since it
This can be explained with the growing trend in is a critical stage in life which evidently results to
globalization where communication across hampering the development of an individual if not
countries are more accessible. It can also be cared for at an early age. This also shows the
noted that no studies were completed on increasing number of children where their needs
educational policy with evaluation for the years are undermined and intervention has to take
1993, 1995, 1997 and 2005. The trend in the place. These programs sought the assistance of
number of studies shows that consequently after the world bank because they need further
a year, the study gives more generalized findings funding for the program to exist. Having an
since the study covered a larger and wide array evaluation of the child program likely supports
of sampling where these studies took a long the approval for further grant.
period of time to finish. More results are expected Somehow there are a large number of
before the end of 2005. The trend of studies studies on basic and tertiary education where its
across the years is significantly different with the effectiveness is evaluated. Almost all countries
expected number of studies as revealed using a offer the same structure of education world wide
one-way chi-square where the computed value in terms of the level from basic education to
(χ2=28.73, df=14) exceeds a probability of tertiary education. These deeply needs attention
χ2=23.58 with 5% probability of error. since it is a basic key to developing nations to
improve the quality of their education because
Table 2 the quality of their people with skills depend on
Counts of Area of Investigation From 1990 - 2006 the countries overall labor force.
When the observed counts of studies for
Area of Investigation Number of Studies each area of interest is tested for significant
Language learning 1 goodness of fit, the computed chi-square value
Undergraduate/Tertiary 4 (χ2=13, df=13) did not reach significance at 5%
Education
Adult literary 2 level of significance. This means that the
Early Child Development 5 observed counts per area do not significantly
AIDS/HIV Prevention 1 differ to what is expected to be produced.
Textbook/Reading material 1
Secondary education 3 Table 3
School Self-evaluation 1
Basic education 4 Study Grants by Country
Test Evaluation 1
Infant Care 1 Country No. of studies
ICT 2 Vanuatu 1
Teacher Development 1 Indonesia 1
Vocational Education 1 Thailand 1
Senegal 1
Different Regions 10
Table 2 shows the number of studies Brazil 1
conducted for every area in line with educational China 1
policy with evaluation. Most of the studies Pakistan 1
completed and funded are in the area of early Cuba 1
child development followed by tertiary education Africa 2
USA 1
and basic education. This can be explained by Chile 1
the increasing number of early child care Philippines 1
20
The studies done for each country are Method of Studies

almost equally distributed except for Africa with
two studies from 1990 until the present period. Various methodologies are used to
There is a bulk of studies done worldwide which investigate the effectiveness of educational
covers a wider array of sampling across different programs across different countries. Although it
countries. The world wide studies usually can be seen in the report that there is not much
evaluate common programs across different concentration and elaboration on the use and
countries such as teacher effectiveness and child implementation of the procedures done to
development programs. Although there is great evaluate the programs. Most only mentioned the
difficulty to come up with an efficient judgment of questionnaires and assessment techniques they
the overall standards of each program used. There are some that mentioned a broad
separately. The advantage of having a world range of methodologies such as quasi-
wide study on educational programs for different experiments and case studies but the specific
regions is to have a simultaneous description of designs are not indicated. It can also be noted
the common programs that are running where that reports written by researchers/professors
the funding is most likely concentrated to one from universities are very clear in their method
team of investigators rather than separate which is academic in nature but world bank
studies with different fund allocations. Another is personnel writing the report tends to focus on the
the efficiency of maintaining consistency of justification of the funding rather than the clarity
procedures across different settings. Unlike of the research procedure undertaken. It can also
different researchers setting different standards be noted that the reports did not show any part
for each country. on the methodology. Most presented the
In the case of Africa two studies were introduction and some justifications of the
granted concentrating on adult literacy and program and later in the end the
distance education because these educational recommendations. The methodologies are just
programs are critical in their country as mentioned and not elaborated within the report
compared to others. As shown in the and only mentioned on some parts of the
demographics of the African region that their justification of the program.
programs (adult literacy, distance education) are
increasingly gaining benefits to its stakeholders. Table 4
There is a report of remarkable improvement on Counts of Methods Used
their adult education and more tertiary students
are benefiting form the distance education. Since Method Counts
they are showing effectiveness, much funding is Questionnaires/Inventories/Tests 4
needed to continue the programs. Quasi Experimental 5
When the number of studies are tested True Experimental 1
for significance across countries, the chi-square Archival Data (Analyzed available 6
computed (χ2=35.44, df=12) reached significance demographics)
against a critical value of χ2=21.03 at 5% Observations 1
probability of error. This means that the number Case Studies 1
of studies for each country differs significantly to Surveys 1
what is expected to be produced. This is also due Multimethod 9
to having a large concentration of studies for
different regions as compared to minimal studies It can be noted in table 4 that most
for each country which made the difference. studies employ a multimethod approach where
21
different methods are employed in a study. The can also be noted that the researchers are not
multimethod approach creates an efficient way of really after the model but in establishing the
cross-validating results for every methodology program or continuity of the program. There are
undertaken. One result in one method can be in marked difference between university
reference to another result to another method academicians and world bank personnel doing
which makes it powerful than using singularity. the study where the latter are misplaced in their
Since evaluation of the program is being done in assessment due to the lack of guidance from a
most studies, it is indeed better to consider using model and academicians would specifically state
a multimethod since it can generate findings the context but somehow failed to elaborate in
where the researcher can arrive with better the process for adopting a CIPP model. Most
judgment and description of the program. studies are clear in their program objectives but
It can also be noted that most studies failed to provide accurate measures of the
are also using archival data to make justifications program directly. The worst is that most studies
of the program. Most these researchers in are actually not guided with the use of a model in
reference to the archival data are coming up with evaluating the educational programs proposed.
inferences from enrollment percentage, drop out
rates, achievement levels, and statistics on Table 5
physical conditions such as weight and height Counts Models/Frameworks Used
etc. which can be valid but they do not directly
assess the effectiveness of the program. The Model/Framework Counts
difficulty of using these statistics is that they do Objectives-Oriented 10
not provide a post measurement of the program Evaluation
evaluated. These may be due to the difficulty of Management-Oriented 9
arriving with national surveys on achievement Evaluation
levels and enrollment profiles of different Consumer-Oriented 0
educational institutions which is done annually Evaluation
but may not be in concordance with the timetable Expertise-Oriented 7
of the researchers. It is also commendable that a Evaluation
number of studies are considering to have quasi- Participant-Oriented 1
experimental designs to directly assess the Evaluation
effectiveness of educational programs. No model specified 3
The counts of the methodologies used is
tested for significance, the computed chi-square As shown in table 5 that majority of the
value (χ2=18.29, df=7) reached significance over evaluation used the objectives-oriented where
the critical chi-square value of χ2=14.07 with 5% they specify the program objectives and
probability of error. This shows that the evaluated accordingly. A large number also used
methodologies used significantly varies to what is the management oriented and specifically made
expected. use of the CIPP by Stufflebeam (1968). A
number of studies also used experts as external
The Use of Evaluation Models evaluators of the program implementation. Most
of the studies actually did not mention the model
The evaluation method used by the used and the models were just identified as
studies were counted. There was difficulty in described by the procedure in conducting the
identifying the models used since the evaluation.
researchers did not specifically elaborate the Most studies used the objectives
evaluation or framework that they are using. It oriented since the thrust is on educational policy
22
and most educational programs start with a will provide a better picture on the worth of a
means of stating objectives. These objectives are program since the judgment on how the program
also treated as ends where the evaluation is is taking place is concentrated on and not other
basically used as the basis. The other studies matters which undermines the result of the
which used the management-oriented evaluation program. A good alternative is for the research
are the ones who typically describe the context of grantee to allocate another budget on a follow up
the educational setting as to the available program evaluation after establishing the
archival data provided by national and program.
countrywide surveys. The inputs and outputs are
also described but most are weak in elaborating 4. It is recommended that when screening for
the process undertaken. The counts on the use studies a criteria on the use of an evaluation
of evaluation models (χ2=18, df=5) reached model should be included. The researchers
significance at 5% error. This means that the making an evaluation study can be guided better
counts are significantly different with the with the use of an evaluation model.
expected. This shows a need to use other
models of evaluation as appropriate to the study
being conducted. References
Recommendations Bray, M. (1996). Decentralization of Education

Community Financing. World Bank Reports.
1. It is recommended to increase distribution of
study grants across countries. There is Brazil Early Child Development: A Focus on the
concentration of performing studies regionally Impact of Preschools. (2001). World Bank
which may neglect cultural and ethical Reports.
considerations on testing and other forms of
assessment. As a consequence there is no Bregman, J. & Stallmeister, S. (2002).
cross-cultural perspective on how the programs Secondary Education in Africa: Strategies for
are implemented for each country because the Renewal. World Bank Reports.
focus is on the consistency of the programs.
Conducting individual studies will show a more
in-depth perspective of the program and how it is Bushman, B. J. (1997). Vote-counting
situated within a specific context. procedures in meta-analysis. In H. Cooper and
2. It is recommended to have a specific section Hedges, L. V. (eds.) The Handbook of Research

on the methodology undertaken by the Synthesis. New York: Russell Sage Publications.
researcher. This helps future researchers to
qualify for the validity of the procedures Craig, H. J., Kraft, R. J., & du Plessis, J. (1998).
undertaken by the study. Specifying clearly the Teacher Development: Making An Impact. World
method used enables the study to be replicated Bank Reports.
as best practices for future researchers and can
easily identify procedures that needs to be Education and HIV/AIDS: A Sourcebook of
improved. HIV/AIDS Prevention Programs. (2003). World
Bank Reports.
3. It is recommended to have separate studies
concentrating exclusively on program evaluation Fretwell, D. I. & Colombano, J. E. (2000). Adult
after successive program implementations. This Continuing Education: An Integral Part Of
Lifelong Learning Emerging Policies and
23
Programs for the 21st Century in Upper and Riley, K. & MacBeath, J. (2000). Putting School
Middle Income Countries. World Bank Reports. self-evaluation in Place. World Bank Reports.
Gasperini, L. (2000). The Cuban Education Saint, W (2000). Tertiary Distance Education and
System: Lessons and Dilemmas. World Bank Technology and sub Saharan Africa. World Bank
Reports. Reports.
Getting an Early Start on Early Child Saunders, L. (2000). Effective Schooling in Rural
Development. (2004). World Bank Reports. Africa Report 2: Key Issues Concerning School
Effectiveness and Improvement. World Bank
Grigorenko, E. L. & Sternberg, R. J. (1999). Reports.
Assessing cognitive Development In Early
Childhood. World Bank Reports. Stufflebeam, D. L. (1968). Evaluation as
enlightenment for decision making. Columbus:
Indonesia - Quality of Undergraduate Education Ohio State University Evaluation Center.
Project. (2004). World Bank Reports.
Tertiary Education in Colombia Paving the Way
Liang, X. (2001).China: Challenges of Secondary for Reform. (2003). World Bank Reports.
Education. World Bank Reports.
Thailand - Universities Science and Engineering
Nordtveit, B. J. (2004). Managing Public–Private Education Project. (2004). World Bank Reports.
Partnership Lessons from Literacy Education in Vanuatu: Learning and Innovation Credit for a
Senegal. World Bank Reports. Second Education Project. (2006). World Bank
Reports.
O'Gara, C., Lusk, D., Canahuati, J., Yablick, G. &
Huffman, S. L. (1999). Good Practices in Infant Ware, S. A. (1992). Secondary School Science in
and Toddler Group Care. World Bank Reports. Developing Countries Status and Issues. World
Bank Reports.
Operational Guidelines for textbooks and reading
materials. (2002). World Bank Reports. Xie, O., & Young, M. E. (1999). Integrated Child
Development in Rural China. World Bank
Orazem, P. F. (2000). The Urban and Rural Reports.
Fellowship School Experiments in Pakistan:
Design, Evaluation, and Sustainability. World Young, E. M. (2000). From Early Child
Bank Reports. Development to Human Development: Investing
in Our Children's Future. World Bank Reports.
Osin, L. (1998). Computers in Education in
Developing Countries: Why and How? World
Bank Reports.
Philippines - Vocational Training Project. (1994).

World Bank Reports.
Potashnik, M. (1996). Chile's Learning Network.

World Bank Reports.
24
Activity # 2
1. Look for an evaluation study that is published in the Asian Development Bank webpage.
2. Summarize the study report in the following:
- What features of the study made it an evaluation?
- What form and model of evaluation was used?
- How was the form or model implemented in the study?
- What aspects of the evaluation study was measured?
25
Lesson 3
The previous lesson clarified the distinction between measurement and evaluation. Upon
knowing the process of assessment in this lesson, you should know now how measurement and
evaluation are used in assessment.
Assessment goes beyond measurement. Evaluation can be involved in the process of
assessment. Some definitions from assessment references show the overlap between assessment
and evaluation. But Popham (1998), Gronlund (1993), and Huba and Freed (2000) defined
assessment without overlap with evaluation. Take note of the following definitions:
1. Classroom assessment can be defined as the collection, evaluation, and use of

information to help teachers make better decisions (McMillan, 2001).
2. Assessment is a process used by teachers and students during instruction that provides
feedback to adjust ongoing teaching and learning to improve students’ achievement of intended
instructional outcomes (Popham, 1998).
3. Assessment is the systematic process of determining educational objectives, gathering,
using, and analyzing information about student learning outcomes to make decisions about
programs, individual student progress, or accountability (Gronlund, 1993).
4. Assessment is the process of gathering and discussing information from multiple and
diverse sources in order to develop a deep understanding of what students know, understand, and
can do with their knowledge as a result of their educational experiences; the process culminates
when assessment results are used to improve subsequent learning (Huba & Freed, 2000).
Cronbach (1960) have three important features of assessment that makes it distinct with
evaluation: (1) Use of a variety of techniques, (2) reliance on observation in structured and
unstructured situations, and (3) integration of information. The three important features of
assessment emphasize that assessment is not based on single measure but a variety of measures.
In the classroom, a student’s grade is composed of the quizzes, assignments, recitations, long
tests, projects, and final exams. These sources were assessed through formal and informal
structures and integrated to come up with an overall assessment as represented by a student’s
final grade. In lesson 1, assessment was defined as “the process of the collecting various
information needed to come up with an overall information that reflects the attainment of goals
and purposes.” There are three critical characteristics of this definition:
1. Process of collecting various information. A teacher arrives at an assessment after

having conducted several measures of student’s performance. Such sources are recitations, long
tests, final exams, and projects. Likewise, a student is proclaimed as gifted after having tested
with a battery (several) of intelligence and ability tests. A student to be designated at Attention
Deficit Disorder (ADD) needs to be diagnosed by several attention span and cognitive tests with
a series of clinical interviews by a skilled clinical psychologist. A variety of information is
needed in order to come up with a valid way of arriving with accurate information.
2. Integration of overall information. Coming up with an integrated assessment from

various sources need to consider many aspects. The results of individual measures should be
consistent with each other to meaningfully contribute in the overall assessment. In such cases, a
26
battery of intelligence tests should yield the same results in order to determine the overall ability
of a case. In cases where some results are inconsistent, there should be a synthesis of the overall
assessment indicating that in some measures the result do not support the overall assessment.
3. Attainment of goals and purposes. Assessment is conducted based on specified goals.

Assessment processes are framed for a specified objective to determine if they are met.
Assessment results are the best way to determine the extent to which a student has attained the
objectives intended.
The process of assessment was summarized by Bloom (1970). He indicated that there are
two processes involve d in assessment:
1. Assessment begins with an analysis of criterion. The identification of criterion

includes the expectations and demands and other forms of learning targets (goals, objectives,
expectations, etc.).
2. It proceeds to the determination of the kind of evidence that is appropriate about the
individuals who are placed in the learning environment such as their relevant strengths and
weaknesses, skills, and abilities.
In the classroom context, it was explain in Lesson 1 that assessment takes place before,
during and after instruction. This process emphasize that assessment is embedded in the teaching
and the learning process. Assessment generally starts in the planning of learning processes when
learning objectives are stated. A learning objective is defined in measurable terms to have an
empirical way of testing them. Specific behaviors are stated in the objectives so that it
corresponds with some form of assessment. During the implementation of the lesson, assessment
can occur. A teacher may provide feedback based on student recitations exercises, short quizzes,
and classroom activities that allow students to demonstrate the skill intended in the objectives.
The assessment done during instruction should be consistent with the skills required in the
objectives of the lesson. The final assessment is then conducted after enough assessment can
demonstrate student mastery of the lesson and their skills. The final assessment conducted can be
the basis for the succeeding objectives for the next lesson. The figure below illustrates the
process of assessment.
Figure 1
The Process of Assessment in the Teaching and Learning Context
Assessment
Learning Learning Assessment

Assessment
Objectives Experience
27
Forms of Assessment
Assessment comes in different forms. It can be classified as qualitative or quantitative,

structured or unstructured, and objective or subjective.
Quantitative and Qualitative
Assessment is not limited to quantitative values, assessment can also be qualitative.

Examples of qualitative assessments are anecdotal records, written reports, written observations
in narrative forms. Qualitative assessments provide a narrative description of attributes of
students, such as their strengths and weaknesses, areas that need to be improved and specific
incidents that support areas of strengths and weaknesses. Quantitative values uses numbers to
represent attributes. The advantages of quantification were described in Lesson 2. Quantitative
values as results in assessment facilitate accurate interpretation. Assessment can be a
combination of both qualitative and quantitative results.
Structured vs. Unstructured
Assessment can come in the form of structured or unstructured way of gathering data.
Structured forms of assessment are controlled, formal, and involve careful planning and
organized implementation. Examples of formal assessment are the final exams where it is
announced, students are provided with enough time to study, the coverage is provided, and the
test items are reviewed. A formal graded recitation can be a structured form of assessment when
it is announced, questions are prepared, and students are informed of the way they are graded in
their answers. Unstructured assessment can be informal in terms of its processes. An example
would be a short unannounced quiz just to check if students have remembered the past lesson,
informal recitations during discussion, and assignments arising from the discussion.
Objective vs. Subjective
Assessment can be objective or subjective. Objective assessment has less variation in

results such as objective tests, seatworks, and performance assessment with rubrics with right
and wrong answers. Subjective assessment on the other hand results to larger variation in results
such as essays and reaction papers. Careful procedures should be undertaken as much as possible
ensure objectivity in assessing essays and reaction papers.
Components of Classroom Assessment
Tests
Tests are basically tools that measure a sample of behavior. Generally there are a variety
of tests provided inside the classroom. It can be in the form of a quiz, long tests (usually covering
smaller units or chapters of a lesson), and final exams. Majority of the tests for students are
teacher-made-tests. These tests are tailored for students depending on the lesson covered by the
syllabus. The tests are usually checked by colleagues to ensure that items are properly
constructed.
28
Teacher made tests vary in the form of a unit, chapter, or long test. These generally assess
how much a student learned within a unit or chapter. It is a summative test in such a way that it is
given after instruction. The coverage is only what has been taught in a given chapter or tackled
within a given unit.
Tests also come in the form of a quiz. It is a short form assessment. It usually measures
how much the student acquired within a given period or class. The questions are usually from
what has been taught within the lesson for the day or topic tackled in a short period of time, say
for a week. On the other hand, it can be summative or formative. It can be summative if it aims
to measure the learning from an instruction, or formative if to aims to tests how much the
students already know prior the instruction. The results of quiz can be used by the teacher to
know where to start the lesson (example, the students already know how to add single digits, and
then she can already proceed to adding double digits). It can also determine if the objectives for
the day are met.
Recitation
A recitation is the verbal way of assessing students’ expression of their answers to some
stimuli provided in the instruction or by the teacher. It is a kind of assessment in which oral
participation of the student is expected. It serves many functions such as before the instruction to
ask the prior knowledge of the students about the topic. It can also be done during instruction,
wherein the teacher solicits ideas from the class regarding the topic. It can also be done after
instruction to assess how much the student learned from the lesson for the day.
Recitations are facilitated by questions provided by the teacher and it is meant that
students undergo thinking in order to answer the questions. There are many purposes of
recitation. A recitation is given if teachers wanted to assess whether students can recall facts and
events from the previous lesson. A recitation can be done to check whether a student understands
the lesson, or can go further in higher cognitive skills. Measuring high order cognitive skills
during recitation will depend in the kind of question that the teacher provides. Appraising a
recitation can be structured or unstructured. Some teachers announce the recitation and the
coverage beforehand to allow students prepare. The questions are prepared and a system of
scoring the answers are provided as well. Informal recitations are just noted by the teacher.
Effective recitations inside the classroom are marked by all students having an equal chance of
being called. Some concerns of teacher regarding the recitation process are as follows:
Should the teacher call more on the students who are silent most of the time in class?
Should the teacher ask students who could not comprehend the lesson easily more often?
Should recitation be a surprise?
Are the difficult questions addressed to disruptive students?
Are easy questions only for students who are not performing well in class?
Projects
Projects can come in a variety of form depending on the objectives of the lesson, a
reaction paper, a drawing, a class demonstration can all be considered as projects depending on
the purpose. The features of a project should include: (1) Tasks that are more relevant in the real
life setting, (2) requires higher order cognitive skills, (2) can assess and demonstrate affective
29
and psychomotor skills which supplements instruction, and (4) requires application of the
theories taught in class.
Performance Assessment
Performance assessment is a form of assessment that requires students to perform a task

rather than select an answer from a ready-made list. Examples would be students demonstrating
their skill in communication through a presentation, building of a dayorama, dance number
showing different stunts in a physical examination class. Performance assessment can be in the
form of an extended-response exercise, extended tasks, and portfolios. Extended-response
exercises are usually open-ended where students are asked to report their insights on an issue,
their reactions to a film, and opinions on an event. Extended tasks are more precise that require
focused skills and time like writing an essay, composing a poem, planning and creating a script
for a play, painting a vase. These tasks are usually extended as an assignment if the time in
school is not sufficient. Portfolios are collections of students’ works. For an art class the students
will compile all paintings made, for a music class all compositions are collected, for a drafting
class all drawings are compiled. Table 4 shows the different tasks using performance assessment.
Table 4
Outcomes Requiring Performance Assessment
Outcome Behavior
Skills Speaking, writing, listening, oral reading, performing experiments, drawing,
playing a musical instrument, gymnastics, work skills, study skills, and social
skills
Work habits Effectiveness in planning, use of time, use of equipment resources, the
demonstration of such traits as initiative, creativity, persistence, dependability
Social Concern for the welfare of others, respect for laws, respect the property of
attitudes others, sensitivity to social issues, concern for social institutions, desire to work
toward social improvement
Scientific Open-mindedness, willingness to suspend judgment, cause-effect relations, an
attitudes inquiring mind
Interests Expressing feelings toward various educational, mechanical, aesthetic, scientific,
social, recreational, vocational activities
Appreciations Feeling of satisfaction and enjoyment expressed toward music, art, literature,
physical skill, outstanding social contributions
Adjustments Relationship to peers, reaction to praise and criticism authority, emotional
stability, social adaptability
Assignments
Assignment is a kind of assessment which extends classroom work. It is usually a take

home task which the student completes. It may vary from reading a material, problem solving,
research, and other tasks that are accomplishable in a given time. Assignments are used to
supplement a learning task or preparation for the next lesson.
30
Assignments are meant to reinforce what is taught inside the classroom. Tasks on the
assignment are specified during instruction and students carry out these tasks outside of the
school. When the student comes back, the assignment should have helped the student learn the
lesson better.
Paradigm Shifts in the Practice of Assessment
For over the years the practice of assessment has changed due to improvement in
teaching and learning principles. These principles are a result of researches that called for more
information on how learning takes place. The shift is shown from old practices to what should be
ideal in the classroom.
From To
Testing Alternative assessment
Paper and pencil Performance assessment
Multiple choice Supply
Single correct answer Many correct answer
Summative Formative
Outcome only Process and Outcome
Skill focused Task-based
Isolated facts Application of knowledge
Decontextualized task Contextualized task
External Evaluator Student self-evaluation
Outcome oriented Process and outcome
The old practice of assessment focuses on traditional forms of assessment such as paper
and pencil with single correct answer and usually conducted at the end of the lesson. For the
contemporary perspectives in assessment, assessment is not necessarily in the form of paper and
pencil tests because there are skills that are better captured in through performance assessment
such as presentations, psychomotor tasks, and demonstrations. Contemporary practice welcomes
a variety of answers from students where they are allowed to make interpretation of their own
learning. It is now accepted that assessment is conducted concurrently with instruction and not
only serving as a summative function. There is also a shift from assessment items that are
contextualized and having more utility. Rather than asking for the definitions of verbs, nouns,
and pronouns, students are required to make an oral or written communication about their
31
favorite book. It also important that student assess their own performance to facilitate self-
monitoring and self-evaluation.
Activity:
Conduct a simple survey and administer to teachers the questionnaire:
Gender: ___ Male ____ Female Years of teaching experience: ________

Subject currently handled: ____________________
Always Often Sometimes Rarely Never

1. My students collect their works in a portfolio.
2. I look at both the process and the final work in
assessing students tasks.
3. I welcome varied answers among my students
during recitation.
4. I announce the criteria to my students on how they
are graded in their work.
5. I provide feedback on my students performance
often.
6. I use performance assessment when paper and
pencil test are not appropriate.
7. I sue other forms of informal assessment.
8. The students’ final grade in my course is based on
multiple assessment.
9. The students grade their group members during a
group activity aside from the grade I give.
10. I believe that my students’ grades are not
conclusive.
Uses of Assessment
Assessment results have a variety of application from selection to appraisal and aiding
the in the decision making process. These functions of assessment vary within the educational
setting whether it is conducted for human resources, counseling, instruction, research, and
learning.
1. Appraising
Assessment is used for appraisal. Forms of appraisals are the grades, scores, rating, and
feedback. Appraisals are used to provide a feedback on individual’s performance to determine
how much improvement could be done. A low appraisal or negative feedback indicates that
32
performance still needs room for improvement while high appraisal or positive feedback means
that performance needs to be maintained.
2. Clarifying Instructional Objectives
Assessment results are used to improve the succeeding lessons. Assessment results point
out if objectives are met for a specific lesson. The outcome of the assessment results are used by
teachers in their planning for the next lesson. If teachers found out that majority of students
failed in a test or quiz, then the teacher assesses whether the objectives are too high or may not
be appropriate for students’ cognitive development. Objectives are then reformulated to
approximate students’ ability and performance that is within their developmental stage.
Assessment results also have implications to the objectives of the succeeding lessons. Since the
teacher is able to determine the students’ performance and difficulties, the teacher improves the
necessary intervention to address them. The teacher being able to address the deficiencies of
students based on assessment results is reflective of effective teaching performance.
3. Determining and reporting pupil achievement of education objectives
The basic function of assessment is to determine students’ grades and report their scores
after major tests. The reported grade communicates students’ performance in many stakeholders
such as with teachers, parents, guidance counselors, administrators, and other concerned
personnel. The reported standing of students in their learning show how much they have attained
the instructional objectives set for them. The grade is a reflection of how much they have
accomplished the learning goals.
4. Planning, directing, and improving learning experiences
Assessment results are basis for improvement in the implementation of instruction.

Assessment results from students serve as a feedback on the effectiveness of the instruction or
the learning experience provided by the teacher. If majority of students have not mastered the
lesson the teacher needs to come up with a more effective instruction to target mastery for all the
students.
5. Accountability and program evaluation
Assessment results are used for evaluation and accountability. In making judgments
about individuals or educational programs multiple assessment information is used. Results of
evaluations make the administrators or the ones who implemented the program accountable for
the stakeholders and other recipients of the program. This accountability ensures that the
program implementation needs to be improved depending in the recommendations from
evaluations conducted. Improvement takes place if assessment coincides with accountability.
6. Counseling
Counseling also uses a variety of assessment results. The variables such as study habits,
attention , personality, and dispositions, are assessed in order to help students improve them.
33
Students who are assessed to be easily distracted inside the classroom can be helped by the
school counselor by focusing the counseling session in devising ways to improve the attention of
a student. A student who is assessed to have difficulties in classroom tasks are taught to self-
regulate during the counseling session. Students’ personality and vocational interests are also
assessed to guide them in the future courses suitable for them to take.
7. Selecting
Assessment is conducted in order to select students placed in the honor roll, pilot
sections. Assessment is also conducted to select from among student enrollees who will be
accepted in a school, college or university. Recipients of scholarships and other grants are also
based on assessment results.
Guide Questions:
1. What are the other uses of Assessment?

2. Major decision in the educational setting needs to be backed up by assessment results?
3. What are the things assessed in your school aside from selection of students and reporting
grades?
References
Bloom, B. (1970). Toward a theory of testing which include measurement-assessment-

evaluation. In M. C. Wittrock, and D. E Wiley (Eds.), The evaluation of instruction: Issues and
problems (pp. 25-69). New York: Holt, Rinehart, & Winston.
Chen, H. (2005). Practical program evaluation. Beverly Hills, CA: Sage.
Fitzpatrick, J. L., Sanders, J. R., & Worthen, B. R. (2004). Program evaluation: Alternative
approaches and practical guidelines (3rd ed.). New York: Pearson.
Gronlund, N. E. (1993). How to write achievement tests and assessment (5th ed.). Needham
Heights: Allyn & Bacon.
Huba, M. E. & Freed, J. E. (2000). Learner-Centered Assessment on College Campuses -

Shifting the Focus from Teaching to Learning. Boston: Allyn and Bacon.
Joint Committee on Standards for Educational Evaluation. (1994). The program evaluation
standards (2nd ed.). Thousand Oakes, CA: Sage.
Magno, C. (2007). Program evaluation of the civic welfare training services (Tech Rep. No. 3).
Manila, Philippines: De La Salle-College of Saint Benilde, Center for Learning and Performance
Assessment.
34
McMillan, J. H. (2001). Classroom assessment: Principles and practice for effective instruction.
Boston: Allyn & Bacon.
Nunnaly, J. C. (1970). Introduction to psychological measurement. New York: McGraw Hill.
Popham, W. J. (1998). Classroom assessment: What teachers need to know (2nd ed.). Needham
Heights, MA: Allyn & Bacon.
Scriven, M. (1967). The methodology of evaluation: Perspectives of curriculum evaluation.

Chicago: Rand McNally.
35
Chapter 2
The Learning Intents
Chapter Objectives
1. Describe frameworks of the various taxonomic tools.

2. Compare and contrast the various taxonomic tools for setting the learning intents.
3. Justify the use of taxonomic tools in assessment planning.
4. Formulate appropriate learning intents.
5. Use the taxonomic tools in formulating the learning intents.
6. Evaluate the learning intents on the basis of the taxonomic framework in use.
Lessons
1 The Conventional Taxonomic Tools

Bloom’s Taxonomy
The Revised Taxonomy
2 The Alternative Taxonomic Tools
Gagne’s taxonomic guide
Stiggins & Conklin’s taxonomic categories
The New Taxonomy
The Thinking Hats
3 Specificity of the Learning Intents
36
Lesson 1: The Taxonomic Tools
Having learned about measurement, assessment, and evaluation, this chapter will bring
you to the discussion on the learning intents, which refer to the objectives or targets the teacher
sets as the competency to build on the students. This is the target skill or capacity that you want
students to develop as they engage in the learning episodes. The same competency is what you
will soon assess using relevant tools to generate quantitative and qualitative information about
your students’ learning behavior.
Prior to designing your learning activities and assessment tasks, you first have to
formulate your learning intents. These intents exemplify the competency you wish students will
develop in themselves. It this point, your deep understanding on how learning intents should be
formulated is very useful. As you go through this chapter, your knowledge about the guidelines
in formulating these learning intents will help you understand how assessment tasks should be
defined.
In formulating learning intents, it is helpful to be aware that appropriate targets of
learning come in different forms because learning environments differ in many ways. What is
crucial is the identification of which intents are more important than the others so that they are
given appropriate priority. When you formulate statements of learning intents, it is important that
you have a strong grasp of some theories of learning as these will aid you in determining what
competency could possibly be developed in the students. If you are familiar with Bloom’s
taxonomy, dust yourself off in terms of your understanding of it so that you can make a good use
of it.
EVALUATION
SYNTHESIS
ANALYSIS
APPLICATION
COMPREHENSION
KNOWLEDGE
Figure 1
Bloom’s Taxonomy
37
Figure 1 shows a guide for teachers in stating learning intents based on six dimensions of
cognitive process. Knowledge, being the one whose degree of complexity is low, includes
simple cognitive activity such as recall or recognition of information. The cognitive activity in
comprehension includes understanding of the information and concepts, translating them into
other forms of communication without altering the original sense, interpreting, and drawing
conclusions from them. For application, emphasis is on students’ ability to use previously
acquired information and understanding, and other prior knowledge in new settings and applied
contexts that are different from those in which it was learned. For learning intents stated at the
Analysis level, tasks require identification and connection of logic, and differentiation of
concepts based on logical sequence and contradictions. Learning intents written at this level
indicate behaviors that indicate ability to differentiate among information, opinions, and
inferences. Learning intents at the synthesis level are stated in ways that indicate students’
ability to produce a meaningful and original whole out of the available information,
understanding, contexts, and logical connections. Evaluation includes students’ ability to make
judgments and sound decisions based defensible criteria. Judgments include the worth, relevance,
and value of some information, ideas, concepts, theories, rules, methods, opinions, or products.
Comprehension requires knowledge as information is required in understanding it. A
good understanding of information can facilitate its application. Analysis requires the first three
cognitive activities. Both synthesis and evaluation require knowledge, comprehension,
application, and analysis. Evaluation does not require synthesis, and synthesis does not require
evaluation either.
Recently after 45 years since the birth Bloom’s original taxonomy, a revised version has
come into the teaching practice, which was developed by Anderson and Krathwohl. Statements
that describe intended learning outcomes as a result of instruction are framed in terms of some
subject matter content and the action required with the content. To eliminate the anomaly of
unidimensionality of the statement of learning intents in their use of noun phrases and verbs
altogether, Figure 3 shows two separate dimensions of learning: the knowledge dimension and
the cognitive process dimension.
Knowledge Dimension has four categories, three of which include the subcategories of
knowledge in the original taxonomy. The fourth, however, is a new one, something that was not
yet gaining massive popularity at the time when the original taxonomy was conceived. It is new
and, at the same time, important in that it includes strategic knowledge, knowledge about
cognitive tasks, and self-knowledge.
Factual knowledge. This includes knowledge of specific information, its details and other
elements therein. Students make use of this knowledge to familiarize the subject matter or
propose solutions to problems within the discipline.
Conceptual knowledge. This includes knowledge about the connectedness of information
and other elements to a larger structure of thought so that a holistic view of the subject matter or
discipline is formed. Students classify, categorize, or generalize ideas into meaningful structures
and models.
38
Procedural knowledge. This category of includes the knowledge in doing some

procedural tasks that require specific skills and methods. Students also know the criteria for using
the procedures in levels of appropriateness.
Metacognitive knowledge. This involves cognition in general as well as the awareness and
knowledge of one’s own cognition. Students know how they are thinking and become aware of
the contexts and conditions within which they are learning.
Figure 3. Sample Objectives Using The Revised Taxonomy
Remember Understand Apply Analyze Evaluate Create
Factual #1
Conceptual #2 #3
Procedural
Metacognitive #3
# 1: Remember the characters of the story, “Family Adventure.”

# 2: Compare the roles of at least three characters of the story.
# 3: Evaluate the story according to specific criteria.
# 4: Recall personal strategies used in understanding the story.
Cognitive Process Dimension is where specific behaviors are pegged, using active
verbs. However, so that there is consistency in the description of specific learning behaviors, the
categories in the original taxonomies which were labeled in noun forms are now replaced with
their verb counterparts. Synthesis changed places with Evaluation, both are now stated in verb
forms.
Remember. This includes recalling and recognizing relevant knowledge from long-term
memory.
Understand. This is the determination of the meanings of messages from oral, written or
graphic sources.
Apply. This involves carrying out procedural tasks, executing or implementing them in
particular realistic contexts.
Analyze. This includes deducing concepts into clusters or chunks of ideas and
meaningfully relating them together with other dimensions.
Evaluate. This is making judgments relative to clear standards or defensible criteria to
critically check for depth, consistency, relevance, acceptability, and other areas.
39
Create. This includes putting together some ideas, concepts, information, and other
elements to produce complex and original, but meaningful whole as an outcome.
The use of the revised taxonomy in different programs has benefited both teachers and
students in many ways (Ferguson, 2002; Byrd, 2002). The benefits generally come from the fact
that the revised taxonomy provides clear dimensions of knowledge and cognitive processes in
which to focus in the instructional plan. It also allows teachers to set targets for metacognition
concurrently with other knowledge dimensions, which is difficult to do with the old taxonomy.
Lesson 2: Assessment in the Classroom Context
Both the Bloom’s taxonomy and the revised taxonomy are not the only existing
taxonomic tools for setting our instructional targets. There are other equally useful taxonomies.
One of these is developed by Robert M. Gagne. In his theory of instruction, Gagne desires to
help teachers make sound educational decisions so that the probability that the desired results in
learning are achieved is high. These decisions necessitate the setting of intentional goals that
assure learning.
In stating learning intents using Gagne’s taxonomy, we can focus on three domains. The
cognitive domain includes Declarative (verbal information), Procedural (intellectual skills), and
Conditional (cognitive strategies) knowledge. The psychological domain includes affective
knowledge (attitudes). The psychomotor domain involves the use of physical movement (motor
skills).
Verbal Information includes a vast body of organized knowledge that students acquire through
formal the instructional processes, and other media, such as television, and others. Students
understand the meaning of concepts rather than just memorizing them. This condition of learning
lumps together the first two cognitive categories of Bloom’s taxonomy. Learning intents must
focus on differentiation of contents in texts and other modes of communication; chunking the
information according to meaningful subsets; remembering and organizing information.
Intellectual Skills include procedural knowledge that ranges from Discrimination, to Concrete
Concepts, to Defined Concepts, to Rules, and to Higher Order Rules.
Discrimination involves the ability to distinguish objects, features, or symbols. Detection
of difference does not require naming or explanation.
Concrete Concepts involve the identification of classes of objects, features, or events,
such as differentiating objects according to concrete features, such as shape.
Defined Concepts include classifying new and contextual examples of ideas, concepts, or
events by their definitions. Here, students make use labels of terms denoting defined
concepts for certain events or conditions.
Rules apply a single relationship to solve a group of problems. The problem to be solved
is simple, requiring conformance to only one simple rule.
40
Higher order rules include the application of a combination of rules to solve a complex
problem. The problem to be solved requires the use of complex formula or rules so that
meaningful answers are arrived at.
Learning intents stated at this level of cognitive domain must given attention to abilities
to spot distinctive features, use information from memory to respond to intellectual tasks in
various contexts, make connections between concepts and relate them to appropriate situations.
Cognitive Strategies consist of a number of ways to make students develop skills in guiding and
directing their own thinking, actions, feelings, and their learning process as a whole. Students
create and hone their metacognitive strategies. These processes help then regulate and oversee
their own learning, and consist of planning and monitoring their cognitive activities, as well as
checking the outcomes of those activities. Learning intents should emphasize abilities to describe
and demonstrate original and creative strategies that students have tried out in various conditions
Attitudes are internal states of being that are acquired through earlier experience of task
engagement. These states influence the choice of personal response to things, events, persons,
opinions, concepts, and theories. Statements of learning intents must establish a degree of
success associated with desired attitude, call for demonstration of personal choice for actions and
resources, and allow observation of real-world and human contexts.
Motor Skills are well defined, precise, smooth and accurately timed execution of performances
involving the use of the body parts. Some cognitive skills are required for the proper execution
of motor activities. Learning intents drawn at this domain should focus on the execution of fine
and well-coordinated movements and actions relative to the use of known information, with
acceptable degree of mastery and accuracy of performance.
Another taxonomic tool is one developed by Stiggins & Conklin (1992), which involves
categories of learning as bases in stating learning intents.
Knowledge This includes simple understanding and mastery of a great deal of subject matter,
processes, and procedures. Very fundamental to the succeeding stages of learning
is the knowledge and simple understanding of the subject matter. This learning
may take the form of remembering facts, figures, events, and other pertinent
information, or describe, explain, and summarize concepts, and cite examples.
Learning intents must endeavor to develop mastery of facts and information as
well as simple understanding and comprehension of them.
Reasoning This indicates ability to use deep knowledge of subject matter and procedures to
make defensible reason and solve problems with efficiency. Tasks under this
category include critical and creative thinking, problem solving, making
judgments and decisions, and other higher order thinking skills. Learning intents
must, therefore, focus on the use of knowledge and simple understanding of
information and concepts to reason and solve problems in contexts.
Skills This highlights the ability to demonstrate skills to perform tasks with acceptable
degree of mastery and adeptness. Skills involve overt behaviors that show
knowledge and deep understanding. For this category, learning intents have to
41
take particular interest in the demonstration of overt behaviors or skills in actual

performance that requires procedural knowledge and reasoning
Products In this area, the ability to create and produce outputs for submission or oral
presentations is given importance. Because outputs generally represent mastery of
knowledge, deep understanding, and skills, they must be considered as products
that demonstrate the ability to use those knowledge and deep understanding, and
employ skills in strategic manner so that tangible products are created. For the
statement of learning intents, teachers must state expected outcomes, either
process- or product-oriented.
Affect Focus is on the development of values, interests, motivation, attitudes, self-

regulation, and other affective states. In stating learning intents on this category, it
is important that clear indicators of affective behavior can easily be drawn from
the expected learning tasks. Although many teachers find it difficult to determine
indicators of affective learning, it is inspiring to realizing that it is not impossible
to assess it.
These categories of learning by Stiggins and Conklin are helpful especially if your intents
focus on complex intellectual skills and the use of these skills in producing outcomes to increase
self-efficacy among students. In attempting to formulate statements of learning outcome at any
category, you can be clear about what performance you want to see at the end of the instruction.
In terms of assessment, you would know exactly what to do and what tools to use in assessing
learning behaviors based on the expected performance. Although stating learning outcomes at
the affective category is not as easy to do as in the knowledge and skill categories, but trying it
can help you approximate the degree of engagement and motivation required to perform what is
expected. Or if you would like to also give prominence to this category without stating another
learning intent that particularly focus on the affective states, you might just look for some
indicators in the cognitive intents. This is possible because knowledge, skills, and attitudes are
embedded in every single statement of learning intent.
Another alternative guide for setting the learning targets is one that had been introduced
to us by Robert J. Marzano in his Dimensions of Learning (DOL). As a taxonomic tool, the DOL
provides a framework for assessing various types of knowledge as well as different aspects of
processing which comprises six levels of learning in a taxonomic model called the new taxonomy
(Marzano & Kendall, 2007). These levels of learning are categorized into different systems.
The Cognitive System

The cognitive system includes those cognitive processes that effectively use or
manipulate information, mental procedures and psychomotor procedures in order to successfully
complete a task. It indicates the first four levels of learning, such as:
Level 1: Retrieval. In this level of the cognitive system students engage some mental
operations for recognition and retrieval of information, mental procedure, or psychomotor
procedure. Students engage in recognizing, where they identify the characteristics, attributes,
qualities, aspects, or elements of information, mental procedure, or psychomotor procedure;
42
recalling, where they remember relevant features of information, mental procedure, or

psychomotor procedure; or executing, where they carry out a specific mental or psychomotor
procedure. Neither the understanding of the structure and value of information nor the how’s and
why’s of the mental or psychomotor procedure is necessary.
Level 2: Comprehension. As the second level of the cognitive system, comprehension
includes students’ ability to represent and organize information, mental procedure or
psychomotor procedure. It involves symbolizing where students create symbolic representation
of the information, concept, or procedures with a clear differentiation of its critical and
noncritical aspects; or integrating, where they put together pieces of information into a
meaningful structure of knowledge or procedure, and identify its critical and noncritical aspects.
Level 3: Analysis. This level of the cognitive system includes more manipulation of
information, mental procedure, or psychomotor procedure. Here students engage in analyzing
errors, where they spot errors in the information, mental procedure, or psychomotor procedure,
and in its use; classifying the information or procedures into general categories and their
subcategories; generalizing by formulating new principles or generalizations based on the
information, concept, mental procedure, or psychomotor procedure; matching components of
knowledge by identifying important similarities and differences between the components; and
specifying applications or logical consequences of the knowledge in terms of what predictions
can be made and proven about the information, mental procedure, or psychomotor procedure.
Level 4: Knowledge Utilization. The optimal level of cognitive system involves
appropriate use of knowledge. At this level, students put the information, mental procedure, or
psychomotor procedure to appropriate use in various contexts. It allows for investigating a
phenomenon using certain information or procedures, or investigating the information or
procedure itself; using information or procedures in experimenting knowledge in order to test
hypotheses, or generating hypotheses from the information or procedures; problem solving,
where students use the knowledge to solve a problem, or solving a problem about the knowledge
itself; and decision making, where the use of information or procedures help arrive at a decision,
or decision is made about the knowledge itself.
The Metacognitive System

The metacognitive system involves students’ personal agency of setting appropriate goals
of their learning and monitoring how they go through the learning process. Being the 5th level of
the new taxonomy, the metacognitive system includes those learning targets as specifying goals,
where students set goals in learning the information or procedures, and make a plan of action for
achieving those goals; process monitoring, where students monitor how they go about the action
they decided to take, and find out if the action taken effectively serves their plan for learning the
information or procedures; clarity monitoring, where students determine how much clarity has
been achieved about the knowledge in focus; and accuracy monitoring, where students see how
accurately they have learned about the information or procedures.
The Self System

Placed at the highest level in the new taxonomy, the Self System is the level of learning
that sustains students’ engagement by activating some motivational resources, such as their self-
43
beliefs in terms of personal competence and the value of the task, emotions, and achievement-
related goals. At this level, students reason about their motivational experiences. They reason
about the value of knowledge by examining importance of the information or procedures in their
personal lives; about their perceived competence by examining efficacy in learning the
information or procedures; about their affective experience in learning by examining emotional
response to the knowledge under study; about their overall engagement by examining motivation
in learning the information or procedures.
In each system, three dimensions of knowledge are involved, such as information, mental
procedures, and psychomotor procedures.
Information
The domain of informational knowledge involves various types of declarative knowledge
that are ordered according to levels of complexity. From its most basic to more complex levels, it
includes vocabulary knowledge in which meaning of words are understood; factual knowledge,
in which information constituting the characteristics of specific facts are understood; knowledge
of time sequences, where understanding of important events between certain time points is
obtained; knowledge of generalizations of information, where pieces of information are
understood in terms of their warranted abstractions; and knowledge of principles, in which causal
or correlational relationships of information are understood. The first three types of
informational knowledge focus on knowledge of informational details, while the next two types
focus on informational organization.
Mental Procedures
The domain of mental procedures involves those types of procedural knowledge that
make use of the cognitive processes in a special way. In its hierarchic structure, mental
procedures could be as simple as the use of single rule in which production is guided by a small
set of rules that requires a single action. If single rules are combined into general rules and are
used in order to carry out an action, the mental procedures are already of tactical type, or an
algorithm, especially if specific steps are set for specific outcomes. The macroprocedures is on
top of the hierarchy of mental procedures, which involves execution of multiple interrelated
processes and procedures.
Psychomotor Procedures
The domain of psychomotor procedures involves those physical procedures for
completing a task. In the new taxonomy, psychomotor procedures are considered a dimension of
knowledge because, very similar to mental procedures, they are regulated by the memory system
and develop in a sequence from information to practice, then to automaticity (Marzano &
Kendall, 2007).
In summary, the new taxonomy of Marzano & Kendal (2007) provides us with a
multidimensional taxonomy where each system of thinking comprises three dimensions of
knowledge that will guide us in setting learning targets for our classrooms. Table 2a shows the
matrix of the thinking systems and dimensions of knowledge.
44
Systems of Dimensions of Knowledge
Thinking Information Mental Procedure Psychomotor Procedure
Level 6
(Self System)
Level 5
(Metacognitive System)
Level 4:
Knowledge Utilization
(Cognitive System)
Level 3:
Analysis
(Cognitive System)
Level 2:
Comprehension
(Cognitive System)
Level 1:
Retrieval
(Cognitive System)
Now, if you wish to explore on other alternative tools for setting your learning objectives,
here’s another help for us to make our learning intents target on the more complex learning
outcomes, this one from Edward de Bono (1985). There are six thinking hats, each of which is
named for a color that represents a specific perspective. When these hats are “worn” by the
student, information, issues, concepts, theories, and principles are viewed in ways that are
descriptive of mnemonically associated perspectives of the different hats. Let’s say that your
learning intent necessitates students to mentally put on a white hat whose descriptive mental
processes include gathering of information and thinking how it can be obtained, and the
emotional state is neutral, then learning behaviors may be classifying facts and opinions, among
others. It is essential to be conscious that each hat that represents a particular perspective
involves a frame of mind as well as an emotional state. Therefore, the perspective held by the
students when a hat is mentally worn, would be a composite of mental and emotional states.
Below is an attempt to summarize these six thinking hats.
45
THE HATS
WHITE RED BLACK YELLOW GREEN BLUE
Perspective Observer Self & others Self & others Self & others Self & others Observer
Stern judge
Representation White paper, Fire, warmth wearing black Sunshine, Vegetation Sky, cool
neutral rode optimism
Looking for Presenting Judging with Looking for Exploring Establishing

needed views, a logical benefits and possibilities control of the
objective feelings, negative productivity & making process of
Descriptive thinking and
facts and emotions, and view, looking with logical hypotheses,
Behavior information, intuition for wrongs & positive view, composing engagement,
including how without playing the seeing what new ideas using
these can be explanation or devil’s is good in with creative metacognition
obtained justification advocate anything thinking
Figure 5
Summative map of the Six Thinking Hats
These six thinking hats are beneficial not only in our teaching episodes but also in the
learning intents that we set for our students. If qualities of thinking, creative thinking
communication, decision-making, and metacognition are some of those that you want to develop
in your students, these six thinking hats could help you formulate statements of learning intents
that clearly set the direction of learning. Added benefits would be that when your intents are
stated in the planes these hats, the learning episodes can be defined easily. Consequently,
assessment is made more meaningful.
A. Formulate statements of learning intent using the Revised taxonomy, focusing on any
category of knowledge dimension but on the higher categories of cognitive dimension.
B. Bring those statements of learning intents to Robert Gagne’s taxonomy and see where they
will fit. You may customize the statements a bit so that they fit well to any of Gagne’s
categories of learning.
C. Do the same process of fitting to Stiggins’ categories of learning, then The New
Taxonomy. Remember to customize the statements when necessary.
D. Draw insights from the process and share them in class.
46
Lesson 3: Specificity of the learning intent
Learning intents usually come in relatively specific statements of desired learning

behavior or performance we would like to see in our students at the end of the instructional
process. In making these intents facilitate relevant assessment, it is important that they are stated
with very active verbs, those that represent clear actions or behaviors so that indicators of
performance are easily identified. These active verbs are and essential part of the statement of
learning intents because they specify what the students actually do within and at the end of a
specified period of time. In this case, assessment becomes convenient to do because it can
specifically focus on the indicated behaviors or actions.
Gronlund, (in Mcmillan, 2005), uses the term instructional objectives to mean
intended learning outcomes. He emphasizes that instructional objectives should be
stated in terms of specific, observable, and measurable student responses.
In writing statements of learning intents for the course we teach, we aim to state behavior
outcomes to which our teaching efforts are devoted, so that, from these statements, we can
design specific tasks in the learning episodes for our students to engage into. However, we need
to make sure that these statements will have to be set with proper level of generality so that they
don’t oversimplify or complicate the outcome.
A statement of intent could have a rather long range of generality so that many sub-
outcomes may be indicated. Learning intents that are stated in general terms will need to be
defined further by a sample of the specific types of student performance that characterize the
intent. In doing this, assessment will be easy because the performance is clearly defined. Unlike
the general statements of intent that may permit the use of not-so-active verbs such as know,
comprehend, understand, and so on, the specific ones use active verbs in order to define specific
behaviors that will soon be assessed. The selection of these verbs is very vital in the preparation
of a good statement of learning intent. Three points to remember might help select active verbs.
1. See that the verb clearly represents the desired learning intent.
2. Note that the verb precisely specifies acceptable performance of the student.
3. Make sure that the verb clearly describes relevant assessment to be made within or at the
end of the instruction.
The statement, students know the meaning of terms in science is general. Although it
gives us an idea of the general direction of your class towards the expected outcome, we might
be confused as to what specific behaviors of knowing will be assessed. Therefore, it is necessary
that we draw some representative sample of specific learning intent so that we will let students:
47
• write a definition of particular scientific term
• identify the synonym of the word
• give the term that fits a given description
• present an example of the term
• represent the term with a picture
• describe the derivation of the term
• identify symbols that represent the term
• match the term with concepts
• use the term in a sentence
• describe the relationship of terms
• differentiate between terms
• use the term in

If these behaviors are stated completely as specific statements of learning intent, we can
have a number of specific outcomes. To make specifically defined outcomes, the use of active
verbs is helpful. If more specificity is desired, statements of condition and criterion level can be
added to the learning intents. If you think that the statement, student can differentiate between
facts and opinions, needs more specificity, then you might want to add a condition so that it will
now sound like this:
Given a short selection, the student can identify statements of facts and of
opinions.
If more specificity is still desired, you might want to add a statement of
criterion level. This time, the statement may sound like this:
Given a short selection, the student can correctly identify at least 5 statements of
facts and 5 statements of opinion in no more than five minutes without the aid of
any resource materials.
The lesson plan may allow the use of moderately specific statements of learning intents,
with condition and criterion level briefly stated. In doing assessment, however, these intents will
have to be broken down to their substantial details, such that the condition and criterion level are
specifically indicated. Note that it is not necessarily about choosing which one statement is better
than the other. We can use them in planning for our teaching. Take a look at this:
48
Learning Intent: Student will differentiate between facts and opinions from written texts.
Assessment: Given a short selection, the student can correctly identify at least 5
statements of facts and 5 statements of opinion in no more than five
minutes without the aid of any resource materials.
If you insert in the text the instructional activities or learning episodes in well described
manner as well as the materials needed (plus other entries specified in your context), you can
now have a simple lesson plan.
Should the statement of learning intent be stated in terms of

teacher performance or student performance that is to be
demonstrated after the instruction? How do these two differ
from each other?
Should it be stated in terms of the learning process or
learning outcome? How do these two differ from one
another?
Should it be subject-matter oriented or competency-oriented?
References:
Byrd, P. A. (2002). The revised taxonomy and prospective teachers. Theory into Practice, 41, 4,
244
Ferguson, C. (2002). Using the revised taxonomy to plan and deliver team-taught, integrated,
thematic units. Theory into Practice, 41, 4, 238.
Marzano, R. J., & Kendall, J. S. (2007). The new taxonomy of educational objectives. 2nd
edition. CA: Sage Publications Company.
Stiggins & Conklin (1992).
49
Chapter 3
Characteristics of an Assessment Tool
Objectives
1. Determine the use of the different ways of establishing an assessment tools’ validity and
reliability.
2. Familiarize on the different methods of establishing an assessment tools’ validity and
reliability.
3. Assess how good an assessment tool is by determining the index of validity, reliability,
item discrimination, and item difficulty.
Lessons
1 Reliability
Test-retest, split half, parallel forms, internal consistency, inter rater reliability
2 Validity
Content, Criterion-related, construct validity, divergent/convergent
3 Item Difficulty and Discrimination

Classical test theory approach: item analysis of difficulty and discrimination
4 Using a computer software in analyzing test items

50
Lesson 1
Reliability
What makes a good assessment tool? How does one know that a test is good to be used?
Educational assessment tools are judged by their ability to provide results that meet the needs of
users. For example, a good test provides accurate findings about a students’ achievement if users
intend to determine achievement levels. The achievement results should remain stable across
different conditions so that they can be used for longer periods of time.
Assessment
Tool
Reliable Valid Ability to

discriminate
traits
A good assessment tool should be reliable, valid and be able to discriminate traits. You
may have probably encountered several tests that are available in the internet and magazines that
tell what kind of personality that you have, your interests, and dispositions. In order to determine
these characteristics accurately, the tests offered in the internet and magazines should show you
evidence that they are indeed valid or reliable. You need to be critical in selecting what test to
use and consider well if these tests are indeed valid and reliable. There are several ways of
determining how reliable and valid an assessment tool is depending on the nature of the variable
and purpose of the test. These techniques vary from different statistical analysis and this chapter
will also provide the procedure in the computation and interpretation.
Reliability is the consistency of scores across the conditions of time, forms, test, items
and raters. The consistency of results in an assessment tool is determined statistically using the
correlation coefficient. You can refer to the section of this chapter to determine how a
correlations coefficient is estimated. The types of reliability will be explained in three ways:
Conceptually and analytically.
Test-retest Reliability
Test-retest reliability is the consistency of scores when the same test is retested in another
occasion. For example, in order to determine whether a spelling test is reliable, the same
spelling test will be administered again to the same students at a different time. If the scores in
the spelling test across the two occasions are the same, then the test is reliable. Test-retest is a
measure of temporal stability since the test score is tested for consistency across a time gap. The
time gap of the two testing conditions can be within a week or a month, generally it does not
exceed six months. Test-retest is more appropriate for variables that are stable like psychomotor
skills (typing test, block manipulations tests, grip strength), aptitude (spatial, discrimination,
51
visual rotation, syllogism, abstract reasoning, topology, figure ground perception, surface
assembly, object assembly), and temperament (extraversion/introversion, thinking/feeling,
sensing/intuiting, judging/perceiving).
To analyze the test-retest reliability of an assessment tool, the first and second set of
scores of a sample of test takers is correlated. The higher the correlation the more reliable the test
is.
Procedure for Correlating Scores for the Test-Retest
Correlating two variables involves producing a linear relationship of the set of scores. For
example a 50 item aptitude test was administered to 10 students at one time. Then it was
administered again after two weeks to the same 10 students. The following are the scores
produced:
Student Aptitude Test (Time 1) Aptitude Retest (Time 2)

A 45 47
B 30 33
C 20 25
D 15 19
E 26 28
F 20 23
G 35 38
H 26 29
I 10 15
J 27 29
In the following data, ‘student A’ got a score of 45 during the first occasion of the aptitude test
and after two weeks got a score of 47 in the same test. For ‘student B,’ a score of 30 was
obtained for the first occasion and 33 after two weeks. The same goes for students C, D. E, F, G,
H, I, and J. The scores of the test at time 1 and retest at time 2 is plotted in a graph called a
scatterplot below. The straight linear line projected is called a regression line. The closer the
plots to the regression line, the stronger is the relationship between the test and retest scores. If
their relationship is strong, then the test scores are consistent and can be interpreted as reliable.
To estimate the strength of the relationship a correlation coefficient needs to be obtained. The
correlation coefficient gives information about the magnitude, strength, significance, and
variance of the relationship of two variables.
52
Scatterplot of Aptitude Retest (Time 2) against Aptitude Test (Time 1)

sample data test retest 2v*10c
Aptitude Retest (Time 2) = 5.2727+0.9184*x
50
A
45
40
G
Aptitude Retest (Time 2)
35
B
30 HJ
E
C
25
F
20 D
I
15
10
5 10 15 20 25 30 35 40 45 50
Aptitude Test (Time 1)
Different types of correlation coefficients are used depending on the level of measurement of a
variable. Levels of measurement can be nominal, ordinal, interval, and ratio. More information
about the levels of measurement is explained at the beginning chapters of any statistics book.
Most commonly, assessment data are in the interval scales. For interval and ratio or continuous
variables, the statistics that estimates the correlation coefficient is the Pearson Product Moment
correlation or the r. The r is computed using the formula:
NΣXY − (ΣX )(ΣY )

r=
[ NΣX − (ΣX ) 2 ][ NΣY 2 − (ΣY ) 2 ]
2
Where
r = correlation coefficient
N = number of cases (respondents, examinees)
ΣXY = summation of the product of X and Y
ΣX = summation of the first set of scores designated as X
ΣY = summation of the second set of scores designated as Y
ΣX2 = sum of squares of the first set of scores
ΣY2 = sum of squares of the second set of scores
53
To obtain the parameters of ΣX , ΣY, ΣX2, and ΣY2, a table is set up.
Aptitude Test Aptitude Retest

(Time 1) (Time 2)
Student X Y XY X2 Y2
A 45 47 2115 2025 2209
B 30 33 990 900 1089
C 20 25 500 400 625
D 15 19 285 225 361
E 26 28 728 676 784
F 20 23 460 400 529
G 35 38 1330 1225 1444
H 26 29 754 676 841
I 10 15 150 100 225
J 27 29 783 729 841
ΣX=254 ΣY=286 ΣXY =8095 ΣX2 =7356 ΣY2 =8948
To obtain a value of 2115 on the 4th column ion XY, simply multiply 45 and 47, 2025 on the 5th
column is obtained by squaring 45 (452 or 45 x 45), 2209 on the last column is obtained by
squaring 47 (472 or 47 x 47). The same is done for each pair of scores in each row. The values of
ΣX , ΣY, ΣX2, and ΣY2 are obtained by adding up or summating the scores from student A to
student J. The values are then substituted in the equation for Pearson r.
NΣXY − (ΣX )(ΣY )

r=
[ NΣX − (ΣX ) 2 ][ NΣY 2 − (ΣY ) 2 ]
2
10(8095) − ( 254)( 286)

r=
[10(7356) − ( 254) 2 ][10(8948) − ( 286) 2 ]
r = .996
An obtained r value of .996 can be interpreted in four ways: Magnitude, strength,

significance, and variance. In terms of its magnitude, by observing the scatterplot the scores
project a regression line showing the increase of the aptitude test increases, the retest scores also
increases. This magnitude is said to be positive. A positive magnitude indicates that as the X
scores increases, the Y scores also increases. In such cases that a correlation coefficient of -.996
is obtained, this indicates a negative relationship where as the X scores increases, the Y scores
decreases or vice versa.
For strength, as the correlation coefficient reaches 1.0 or -1.00 the stronger is the
relationship, the closer it is to “0” the weaker the relationship. A strong relationship indicates
that the plots are very close to the projected linear regression line. In the case of the .996
correlation coefficient, it can be said that there is a very strong relationship between the scores of
54
the aptitude test and retest scores. The cut-off can be used as guide to determine the strength of
the relationship:
Correlation Coefficient Value Interpretation

0.80 – 1.00 Very high relationship
0.6 – 0.79 High relationship
0.40 – 0.59 Substantial/marked relationship
0.2 – 0.39 Low relationship
0.00 – 0.19 Negligible relationship
For significance, it tests whether the odds favor a demonstrated relationship between X
and Y being real as opposed to being chance. If the relationship favors to be real, then the
relationship is said to be significant. Consult a statistics book for a detailed explanation of testing
for the significance of r. To test whether a correlation coefficient of .996 is significant it is
compared with an r critical value. The critical values for r is found in Appendix A of this book.
Assuming that the probability or error is set at alpha level .05 (it means that the probability [p] is
less that [<] 5 out of 100 [.05] than the demonstrated relationship is due to chance) (DiLeonardi
& Curtis, 1992), and the degrees of freedom is 8 (df=N-1, 8=10-1), a critical value of .632 is
attained. A value of .632 is the intersecting value in Appendix A for df=8 and alpha level of .05.
Significance is determined when the obtained value is greater than the critical value. In this case,
since .996 is greater than .632, then there is a significant relationship between the aptitude test
and the retest scores.
For the variance, it is interpreted as the amount of overlap between the X and Y. This is
interpreted as the “percentage of the time that the variability in X accounts for or explains the
variability in Y.” Variance is determined by squaring the correlation coefficient (r2). For the
given data set, the variance would be r2=.9962 (would give a variance of .992), in percent, the
variance is 99.2 (.992 x 100). To interpret this value, “99.2 percent of the time, the scores during
the first aptitude test accounts for or explains the scores during the retest.”
Generally a correlation coefficient of .996 indicates that the test scores for aptitude during
the test and the retest time is highly reliable or consistent since the value is very strong and
significant. A software is provided with this book to help you compute for test retest correlation
coefficients and the other techniques for establish reliability and validity. A detailed
demonstration on using the software is found at the end of this chapter.
Parallel Form or Alternate Form Reliability
In this technique two tests are used that are equivalent in the aspects of difficulty, format,
number of items, and specific skills measured. The equivalent forms are administered to the
same examinee at one occasion and the other in a different occasion. Split half is both a measure
of temporal stability and consistency of responses. Since the two tests are administered
separately across time it is a measure of temporal stability like the test-retest. But on the second
occasion, what is administered is not the exact same test but an equivalent form of the test.
Assuming that the two tests are really measuring the same characteristics, then there should be
consistency on the scores. Parallel forms can be used in affective and cognitive measures in
general as long as there are available forms of the test.
55
Reliability is determined by correlating the scores from the first form and the second
form. In most cases, Form A of the test is correlated with Form B of the test. A strong and
significant relationship would indicate equivalence and consistency of the two forms.
Split-half Reliability
In split-half the test is split into two parts and the scores for each part should show
consistency. The logic behind splitting the test into two parts is to determine whether the scores
within the same test is internally consistent or homogeneous.
There are many ways of splitting the test into two halves. One is by randomly distributing
the items equally into two halves. Another is separating the odd numbered items with the even
numbered items. In doing the split-half reliability, one ensures that the test contains large amount
of items so that there will still be several items left for each half. The assumption here is that
there should be more items in order for the test to be more reliable. It follows that the more the
items in a test, the more it becomes reliable.
Spit-half is analyzed by summating first the total scores for each half of the test for each
participant. The total scores in pairs are then correlated. A high correlation coefficient would
indicate internal consistencies of the responses in the test. Since only half of the test is correlated,
a correction formula called the Spearman-Brown (rtt)is used by doubling the length of the test.
The formula is:
2r
rtt =
1+ r
Where
rtt = Spearman-Brown coefficient

r = correlation coefficient
Suppose that a test that measures aggression with 60 items was split into two halves
having 30 items each half and the computed r is .93. The Spearman-Brown coefficient would be
.96. Observe that from the correlation coefficient of .93 there was an increase to .96 when
converted into Spearman-Brown.
Computation of Correlation coefficient to Spearman-Brown:
2(.93)
rtt =
1 + .93
rtt = .96
56
Internal Consistency Reliabilities
Several techniques can be used to test whether the responses in the items of a test are
internally consistent. The Kuder-Richardon, Cronbach’s alpha, interitem correlation, and item
total correlation can be used.
The Kuder-Richardson (KR #20) is used if the responses in the data are binary. Usually it
is used for tests with right or wrong answers where correct responses are coded as “1” and
incorrect responses are coded as “0.” The KR#20 formula is:
k  Σpq 
KR 20 = 1 − 2 
k −1 σ 
To determine σ2 (variance)
Σx 2
2
σ =
N −1
Where
k=number of items
p=proportion of students with correct answers
q=proportion of students with incorrect answers
σ2=variance
Σx2=sum of squares
Suppose that the following data was obtained in a 10 item math test (“1” correct answer “0”
incorrect answer) among 10 students:
Item X−X
Student Item 1 Item 2 Item 3 Item 4 Item 5 Item 6 Item 7 Item 8 Item 9 10 Total (X) ( X − X )2
A 1 1 1 1 1 1 1 1 1 1 10 2.8 7.84
B 1 1 1 1 1 1 1 0 1 1 9 1.8 3.24
C 1 1 1 1 1 1 1 0 0 1 8 0.8 0.64
D 1 1 1 1 1 1 1 1 0 0 8 0.8 0.64
E 1 1 1 1 1 1 1 0 0 0 7 -0.2 0.04
F 1 1 1 1 1 1 0 0 0 1 7 -0.2 0.04
G 1 1 1 1 1 1 0 1 0 0 7 -0.2 0.04
H 1 1 1 0 0 0 1 1 1 0 6 -1.2 1.44
I 1 1 1 1 0 1 0 0 0 0 5 -2.2 4.84
J 1 1 1 1 0 0 0 0 0 1 5 -2.2 4.84
2
total 10 10 10 9 7 8 6 4 3 5 X =7.2 x =23.6
p 1 1 1 0.9 0.7 0.8 0.6 0.4 0.3 0.5 σ2=2.62
q 0 0 0 0.1 0.3 0.2 0.4 0.6 0.7 0.5
pq 0 0 0 0.09 0.21 0.16 0.24 0.24 0.21 0.25 Σpq=1.4
Variance Computation:
57
Get the total scores of each examinee (X), then compute for the average of the scores of
the ten examinees ( X =7.2). Subtract the mean to each individual total score ( X − X ) then
square each of these differences ( X − X )2. Get the summation of these squared difference and
this value is the Σ( X − X )2. In the data given the value of ( X − X )2 is 23.6 and N=10. Substitute
these values to obtain the variance.
23.6
σ2 =
10 − 1
σ2=2.62
KR20 computation:
The variance is now computed (σ2=2.62), the next step is to obain the value of Σpq. This is
obtained by summating the total correct responses for each item (total). This total is converted
into a proportion (p) by dividing with the total number of cases (N-10). A total of 10 when
divided by 10 (N) will have a proportion of 1. Then to determine q which is the proportion
incorrect, subtract the proportion correct from 1. If the proportion correct is 0.9 the proportion
incorrect will be 0.1. Then pq is determined by multiplying the p and q. Get the summation of
the pq and it will yield Σpq which has the value of 1.4. Substitute all the values in the KR 20
formula.
10  1 .4 
KR 20 = 1 − 
10 − 1  2.62 
KR20 = 0.52
The internal consistency of the 10 item math test is 0.52 indicating the responses are not highly
consistent with each other.
The Cronbach’s alpha also determines the internal consistency of the responses of items
in the same test. The Cronbach’s alpha can be used for responses that are not limited to binary
type such as a five point scale and other response format that are expressed numerically. Usually
tests beyond the binary type are affective measures and inventories where there is no right or
wrong answers.
Suppose that a five item test measuring attitude towards school assignments was
administered to five high school. Each item in the questionnaires is answered using 5 point
Likert Scale (5=strongly agree, 4=agree, 3=not sure, 2=disagree, 1=strongly disagree).
Below are the five items that measures attitude towards school assignments. Each student
will select in a likert scale of 1 to 5 how they respond to each of the items. Then their responses
are encoded.
Items Strongly Agree Not Disagree Strongly

58
agree sure disagree

1. I enjoy doing my assignments. 5 4 3 2 1
2. I believe that assignments help me learn the lesson 5 4 3 2 1
better.
3. I know that assignments are meant to enhance our 5 4 3 2 1
skills in school.
4. I make it a point to check my notes and books 5 4 3 2 1
everyday to see if I have assignments.
5. I make sure that I complete all my assignments 5 4 3 2 1
everyday.
The next table shows how the Cronbach’s alpha is determined given the responses of the
five students. In the next table, student A answered ‘5’ for item 1, ‘5’ for item 2, ‘4’ for item 3,
‘4’ for item 4, and ‘1’ for item 5. The same goes for students B, C, D, and E.
In computing for Cronbach’s alpha, the variance (σ2) for the students’ scores and the
variance (σ2) for the total scores for each item is used instead of the Σpq. Obtaining the variance
for the scores of each respondent is the same in Kuder Richardson where the mean of the scores
is substracted to each score, then the value is squared and the sum of squares (22.8) is divided by
n-1 (5-1=4). Diving the sum of squares (22.9) with the n-1 (4) will give the variance (σ2=5.7).
The same procedure is done for obtaining the item variance Σ( SD t2 ) . Get the sum of all scores
per item (summate going down for each column in the table below), then obtain the mean of the
scores per item ( X item=16.2). The mean is subtracted to each item total (Score –Mean). This
difference is then squared (Score –Mean)2. The sum of squares is then obtained Σ(Score –
Mean)2. The value (38.8) is divided by n-1 and will give the value of Σ(σ t2 ) which is the variance
of the items Σ(σ t2 ) =9.7. The values obtained can now be substituted in the formula for
Cronbach’s alpha:
 n  σ t − Σ(σ t ) 
2 2
Cronbach' sα =   
 n − 1  σ t2 
 5   5 .7 − 9 .7 ) 
 5 − 1   5 .7 
Cronbach’s α = .88
The table below shows the values obtained in the procedure.

59
Student item1 item2 item3 item4 item5 total for each case (X) Score-Mean (Score-Mean)2
A 5 5 4 4 1 19 2.8 7.84
B 3 4 3 3 2 15 -1.2 1.44
C 2 5 3 3 3 16 -0.2 0.04
D 1 4 2 3 3 13 -3.2 10.24
E 3 3 4 4 4 18 1.8 3.24
X case=16.2 Σ(Score-Mean)2=22.8
total for each
= 2
item 14 21 16 17 13 X item 16.2
σ t2 =
∑ ( Score − Mean )
Score-Mean -2.2 4.8 -0.2 0.8 -3.2
n −1
22.8
σ t2 =
5 −1
(Score-Mean)2 4.84 23.04 0.04 0.64 10.24 Σ(Score-Mean)2=38.8 σ t2 = 5.7
2
Σ(σ 2
)=
∑ ( Score − Mean )
t
n −1
38.8
Σ(σ t2 ) =
5 −1
2
Σ(σ t ) = 9.7
2 2
 n  σ t − Σ(σ t ) 
Cronbach' sα =   
 n − 1  σ t2 
 5   5 .7 − 9 .7 ) 
 5 − 1   5 .7 
Cronbach’s α = .88
The internal consistency of the responses in the attitude towards teaching is .88 indicating high internal consistency.
60
Internal consistency is also determined by correlating each combination of items in a test

which is known as the interitem correlation. The responses in the items are internally consistent
if they yield high correlation coefficients.
To demonstrate the interitem correlation among the responses of the five students in their
attitude towards assignments, each set of items scores are correlated with each other using the
Pearson’s r. This means that item 1 is correlated with item 2, item 3, item 4, and item 5, then,
item 2 is correlated with item 3, item 4, and item 5, then item 3 is correlated with item 4 and item
5, then item 4 is correlated with item 5. Such combination will produce a correlation matrix:
Item 1 Item 2 Item 3 Item 4 Item 5

Item 1 1.00 0.24 0.85 0.74 -0.65
Item 2 0.24 1.00 -0.07 -0.22 -0.68
Item 3 0.85 -0.07 1.00 0.87 -0.16
Item 4 0.74 -0.22 0.87 1.00 -0.08
Item 5 -0.65 -0.68 -0.16 -0.08 1.00
Notice that a perfect correlation coefficient is obtained when the item is correlated with
itself (1.00). It can also be noted that strong correlation coefficients were obtained between
items, 1 and 3, 1 and 4, indicating internal consistencies. Some had negative correlations like
between items 1 and 5, and 2 and 5.
Interrater Reliability
When rating scales are used by judges, the responses can also be tested if they are
consistent. The concordance or consistency of the ratings is estimated by computing the
Kendall’s ω coefficient of concordance.
Suppose that following thesis presentation ratings were obtained from three judges for 5
groups who presented their thesis. The rating scale is in a scale of 1 to 4 where 4 is the highest
and 1 is the lowest.
Thesis presentation Rater 1 Rater 2 Rater 3 Sum of Ratings D D2

1 4 4 3 11 2.6 6.76
2 3 2 3 8 -0.4 0.16
3 3 4 4 11 2.6 6.76
4 3 3 2 8 -0.4 0.16
5 1 1 2 4 -4.4 19.36
X Ratings =8.4 ΣD2=33.2
The concordance among three raters using the Kendall’s tau is computed by summating
the total ratings for each case (thesis presentation). The mean is obtained for the sum of ratings
( X Ratings =8.4). The mean is then subtracted to each of the Sum of Ratings (D). Each difference is
61
squared (D2), then the sum of squares is computed (ΣD2=33.2). these values can now be
substituted in the Kendall’s ω formula. In the formula, m is the numbers of raters.
12ΣD 2
W=
m 2 ( N )( N 2 − 1)
12(33.2)
W=
3 (5)(52 − 1)
2
W=0.37
A value of .38 Kendall’s ω coefficient estimates the agreement of the three raters in the 5 thesis
presentations. Given this value, there is a moderate concordance among the three raters because
the value is not very high.
62
Summary on Reliability
Type of Nature Measure of: Use Statistical Procedure

Reliability
Test-retest Repeating the Temporal stability When variables are • Correlate the
identical test on a stable ex: motor scores from the first
second occasion coordination, finger test and second test.
dexterity, aptitude, • The higher the
capacity to learn correlation the more
reliable
Alternate Form/ Same person is Equivalence; Used for personality • Correlate scores on
Parallel Form tested with one form Temporal stability and mental ability the first form and
on the first occasion and consistency of tests scores on the second
and with another response form
equivalent form on
the second
Split-half Two scores are Internal Used for personality • Correlate scores of
obtained for each consistency; and mental ability the odd and even
person by dividing Homogeneity of tests numbered items
the test into items The test should have • Convert the
equivalent halves many items obtained correlation
coefficient into a
coefficient estimate
using Spearman
Brown
Kuder- Computed for binary Consistency of Used if there is a • Use KR #20 or KR

Richardson (e.g., true/false) responses to all correct answer (right #21 formula
Reliability items items or wrong)
Coefficient The reliability is Consistency of Used for personality • Use the
Alpha used to estimate responses to all tests with multiple Cronbach’s alpha
internal consistency items scored-items formula
of items Homogeneity of
items
Inter-item Correlation of all Consistency of Used for personality Each item is correlated
reliability item combinations responses to all tests with multiple with every item in the
items scored-items test
Homogeneity of
items
Scorer Reliability Having a sample of To decrease Performance The two scores from
cases independently examiner or scorer assessments the two raters obtained
scored by two raters variance Clinical instruments are correlated with each
employed in intensive other
individual tests ex. The Kendalls’s ω is
projective tests used to estimate
concordance of raters
63
Activity 1:
Test whether the typing test is valid. The following are the scores of 15 participants on a typing
test using test-retest reliability.
First Test Retest

47 30
45 44
43 40
24 28
35 40
45 46
46 46
34 37
34 35
36 35
43 40
21 25
22 23
23 24
24 20
64
Activity 2
Administer the “Academic Self-regulation Scale” to atleast 30 students then obtain its internal
consistency using split-half, Cronbach’s alpha, and interitem correlation.
Self-regulation Scale
Instruction: The following items assess your learning and study strategy use. Read each item carefully and
RESPOND USING THE SCALE PROVIDED. Encircle the number that corresponds to your answer.
4: Always 3: Often 2: Rarely 1: Never
Before answering the items, please recall some typical situations of studying which you have experienced. Kindly
encircle the number showing how you practice the following items.
Always Often Rarely Never

MS 1. I make and use flashcards for short answer questions or concepts. 4 3 2 1
MS 2. I make lists of related information by categories 4 3 2 1
MS 3. I rewrite class notes by rearranging the information in my own words. 4 3 2 1
MS 4. I use graphic organizers to put abstract information into a concrete form. 4 3 2 1
MS 5. I represent concepts with symbols such as drawings so I can easily remember 4 3 2 1
them.
MS 6. I make a summary of my readings. 4 3 2 1
MS 7. I make outlines as a guide while I am studying. 4 3 2 1
MS 8. I summarize every topic we had in class. 4 3 2 1
MS 9. I visualize words in my mind to recall terms. 4 3 2 1
MS 10. I recite the answers to questions on the topic that I made up. 4 3 2 1
MS 11. I record into a tape the lessons/notes. 4 3 2 1
MS 12. I make sample questions from a topic and answer it. 4 3 2 1
MS 13. I recite my notes while studying for an exam. 4 3 2 1
MS 14. I use post-its to remind me of my homework. 4 3 2 1
MS 15. I make a detailed schedule of my daily activities. 4 3 2 1
GS 16. I make a timetable of all the activities I have to complete. 4 3 2 1
GS 17. I plan the things I have to do in a week. 4 3 2 1
GS 18. I use a planner to keep track on what I am supposed to accomplish. 4 3 2 1
GS 19. I keep track of everything I have to do in a notebook or on a calendar. 4 3 2 1
SE 20. If I am having a difficulty I inquire assistance from an expert. 4 3 2 1
SE 21. I like peer evaluations for every output 4 3 2 1
SE 22. I evaluate my accomplishments at the end of each study session. 4 3 2 1
SE 23. I ask others how my work is before passing it to my professors. 4 3 2 1
SE 24. I take note of my improvements on what I do. 4 3 2 1
SE 25. I monitor my improvement in doing certain task. 4 3 2 1
SE 26. I ask feedback of my performance from someone who is more capable. 4 3 2 1
SE 27. I listen attentively to people who comment on my work. 4 3 2 1
SE 28. I am open to feedbacks to improve my work. 4 3 2 1
SE 29. I browse through my past outputs to see my progress. 4 3 2 1
SE 30. I ask others what changes should be done with my homework, papers, etc. 4 3 2 1
SE 31. I am open to changes based from the feedbacks I received. 4 3 2 1
SA32. I use internet in making my research papers. 4 3 2 1
SA 33. I surf the net to find the information that I need. 4 3 2 1
SA 34. I take my own notes in class. 4 3 2 1
SA 35. I enjoy group works because we help one another. 4 3 2 1
SA 36. I call or text a classmate about the home works that I missed. 4 3 2 1
SA 37. I look for a friend whom I can have an exchange of questions 4 3 2 1
SA 38. I study with a partner to compare notes. 4 3 2 1
SA 39. I explain to my peers what I have learned. 4 3 2 1
ES 40. I avoid watching the television if I have pending homework. 4 3 2 1
ES 41. I isolate myself from unnecessary noisy places. 4 3 2 1
ES 42. I don’t want to hear a single sound while I’m studying. 4 3 2 1
65
ES 43. I can’t study nor do my homework if the room is dark. 4 3 2 1

ES 44. I switch off my TV for me to concentrate on my studies. 4 3 2 1
RS 45. I recheck my homework if I have done it correctly before passing. 4 3 2 1
RS 46. I do things as soon as the teacher gives the task. 4 3 2 1
RS 47. I am concerned with the deadlines set by the teachers. 4 3 2 1
RS 48. I picture in my mind how the test will look like based on previous tests 4 3 2 1
RS 49. I finish all my homework first before doing unnecessary things. 4 3 2 1
OR50. I highlight important concepts and information I found in my readings. 4 3 2 1
OR 51. I make use of highlighters to highlight the important concepts in my reading. 4 3 2 1
OR 52. I put my past notebooks, handouts, and the like in a certain shelf. 4 3 2 1
OR 53. I study at my own pace. 4 3 2 1
OR 54. I fix my things first before I start to study. 4 3 2 1
OR 55. I make sure my study area is clean before studying. 4 3 2 1
MS: memory strategy

GS: Goal Setting
SE: Self-evaluation
SA: Seeking assistance
ES: Environmental Structuring
RS: Responsibility
OR: Organizing
Further Analysis
1. Show the Cronbach’s alpha for each factor and indicate whether the responses are
internally consistent.
2. Split the test into two then indicate whether the responses are internally consistent.
3. Intercorrelate each item.
66
Lesson 2
Validity
Validity indicates whether an assessment tool is measuring what it intends to measure.

Validity estimates indicate whether the latent variable shared by items in a test is in fact the
target variable of the test developer. Validity is the ability of a scale or test to predict events,
relationship with other measures, and representativeness of item content.
Content Validity
Content validity is the systematic examination of the test content to determine whether it
covers a representative sample of the behavior domain to be measured. For affective measures, it
concerns whether the items are enough to manifest the behavior measured. For cognitive tests, it
concerns whether the items cover all contents specified in an instruction.
Content validity is more appropriate for cognitive tests like achievement tests and teacher
made tests. In these types of tests, there is a presence of a specified domain that will be included
in the test. The content covered is found in the instructional objectives in the lesson plan,
syllabus, table of specifications, and textbooks.
Content validity is conducted through consultation with experts. In the process, the
objectives of the instruction, table of specifications, and items of the test are shown to the
consulting experts. The experts check whether the items are enough to cover the content of the
instruction provided, whether the items are measuring the objectives set, and if the items are
appropriate for the cognitive skill intended. The process also involves correcting the items if they
are appropriately phrased for the level that will take the test and whether the items are relevant to
the subject area tested.
Details on constructing Table of Specifications are explained in the next chapters.
Criterion-Prediction Validity
Criterion-prediction involves prediction from the test to any criterion situation over time
interval. For example, to assess the predictive validity of an entrance exam, it will be correlated
later with the students’ grades after a trimester/semester. The criterion in this case would be the
students’ grade which will come in the future.
Criterion-prediction is used for hiring job applicants, selecting students for admission to
college, assigning military personnel to occupational training programs. For selecting job
applicants, the pre-employment tests are correlated with the obtained supervisor rating in the
future. In assigning military personnel for training, the aptitude test administered before training
will be correlated with the future post assessment in the training. A positive and high correlation
coefficients should be obtained in these cases to adequately say that the test has a predictive-
validity.
Generally the analysis involves the test score correlated with other criterion measures
example are mechanical aptitude and job performance as a machinist.
67
Construct Validity
Construct validity is the extent to which the test may be said to measure a theoretical
construct or trait. This is usually conducted for measures that are multidimensional or contains
several factors. The goal of construct validity is to explain and prove the factors of the measure
as it is true with the theory used.
There are several methods for analyzing the constructs of a measure. One way is to
correlate a new test with a similar earlier test as measured approximately the same general
behavior. For example, a newly constructed measure for temperament is correlated with an
existing measure of temperament. If high correlations are obtained between the two measures it
means that the two test are measuring the same constructs or traits.
Another widely used technique to study the factor structure of a test is the factor analysis
which can be exploratory of confirmatory. Factor analysis is a mathematical technique that
involves arriving with sources of variation among the constructs involved. These variations are
usually called factors or components (as explained in chapter 1). Factor analysis reduces the
number of variables and it detects the structure in the relationships between variables, or classify
variables. A factor is a set of highly intercorrelated variables. In using a Principal Components
Analysis as a method of factor analysis, the process involves extracting the possible groups that
can be formed through the eigenvalues. A measure of how much variance each successive factor
extracts. The first factor is generally more highly correlated with the variables than the second
factor. This is to be expected because these factors are extracted successively and will account
for less and less variance overall. Factor extraction stops when factors begin to yield low
eigenvalues. An example of the extraction showing eigenvalues is illustrated below in the study
by Magno (2008) where he developed a scale measuring parental closeness with 49 items and
four factors are hypothesized (bonding, support, communication, interaction).
Plot of Eigenvalues
16
15
14
13
12
11
10
9
Value
8
7
6
5
4
3
2
1
0
Number of Eigenvalues
68
The scree plot shows that 13 factors can be used to classify the 49 items. The number of
factors is determined by counting the eigenvalues that are greater than 1.00. But having 13
factors is not good because it does not further reduce the variables. One technique in the scree
test is to assess the place where the smooth decrease of eigenvalues appears to level off to the
right of the plot. To the right of this point, presumably, one finds only "factorial scree" - "scree"
is the geological term referring to the debris which collects on the lower part of a rocky slope. In
applying this technique, the fourth eigenvalue shows a smooth decrease in the graph. Therefore,
four factors can be considered in the test.
The items that will belong under each factor is determined by assessing the factor
loadings of each item. Each item in the process will load in each factor extracted. The item that
highly loaded in a factor will technically belong to that factor because it is highly correlated with
the other items in that factor or group. A factor loading of .30 means that the item contributes
meaningfully to the factor. A factor loading of .40 means the item is highly contributions to the
factor. An example of a table with factor loading is illustrated below.
1 2 3 4
item1 0.032 0.196 0.172 0.696
item2 0.13 0.094 0.315 0.375
item3 0.129 0.789 0.175 0.068
item4 0.373 0.352 0.35 0.042
item5 0.621 -0.042 0.251 0.249
item6 0.216 -0.059 0.067 0.782
item7 0.093 0.288 0.307 0.477
item8 0.111 0.764 0.113 0.085
item9 0.228 0.315 0.144 0.321
item10 0.543 0.113 0.306 -0.01
In the table above, the items that highly loaded to a factor should have a loading of .40 and
above. For example, item 1 highly loaded on factor 4 with a factor loading of .696 as compared
with the other loadings .032, .196, and 0.172 for factors 1, 2, and 3 respectively. This means that
item 1 will be classified under factor 4 together with item 6 and item 7 because they all highly
load under the fourth factor. Factor loadings are best assessed when the items are rotated
(Consult scaling theory references for details on factor rotation).
Another way of proving the factor structure of a construct is through Confirmatory Factor
Analysis (CFA). In this technique, there is a developed and specific hypothesis about the
factorial structure of a battery of attributes. The hypothesis concerns the number of common
factors, their pattern of intercorrelation, and pattern of common factor weights. It is used to
indicate how well a set of data fits the hypothesized structure. The CFA is done as follow-up to a
standard factor analysis. In the analysis, the parameters of the model is estimated, and the
goodness of fit of the solution to the data is evaluated. For example, in the study of Magno
(2008) confirmed the factor structure of parental closeness (bonding, support, communication,
succorance) after a series of principal components analysis. The parameter estimates and the
goodness of fit of the measurement model was then analyzed.
69
Figure 1
Measurement Model of Parental Closeness using Confirmatory Factor Analysis
The model estimates in the CFA shows that all the factors of parental closeness have
significant parameters (8.69*, 5.08*, 5.04*, 1.04*). The delta errors are used (28.83*, 18.02*,
18.08*, 2.58*), and each factor has a significant estimate as well. Having a good fit reflects on
having all factor structures as significant for the construct parental closeness. The goodness of fit
using chi-square is a rather good fit (χ2=50.11, df=2). The goodness of fit based on the Root
Mean square standardized residual (RMS=0.072) shows that there is less error having a value
close to .01. Using Noncentrality fit indeces, the values show that the four factor solution has a
good fit for parental closeness (McDonald Noncentrality Index=0.910, Population Gamma
Index=0.914).
Confirmatory Factor Analysis can also be used to assess the best factor structure of a
construct. For example, the study of Magno, Tangco, and Sy (2007), the assessed the factor
structure of metacognition (awareness of one’s learning) on its effect on critical thinking
(measured by the Watson Glaser Critical Thinking Appraisal). Two factor structured of
metacognition was assessed. The first model of metacognition includes two factors which is
regulation of cognition and knowledge of cognition (see Schraw and Dennison, ). The second
model tested metacognition with eight factors: Declarative knowledge, procedural knowledge,
conditional knowledge, planning, information management, monitoring, debugging strategy, and
evaluation of learning.
70
Model 1. Two Factors of Metacognition
Model 2: Eight Factors of Metacognition

71
The results in the analysis using CFA showed that model 1 has a better fit as compared to
model 2. This indicates that metacogmition is better viewed with two factors (knowledge of
cognition and regulation of cognition) that with eight factors.
The Principal Components Analysis and Confirmatory Factor Analysis can be conducted
using available statistical softwares such as Statistica and SPSS.
Convergent and Divergent Validity
According to Anastasi and Urbina (2002), the method of convergent and divergent
validity is used to prove the correlation of variables with which it should theoretically correlate
(convergent) and also it does not correlate with variables from which it should differ (divergent).
In convergent validity, constructs that are intercorrelated should be high and positive as
explained in the theory. For example, in the study of Magno (2008) on parental closeness, when
the factors of parental closeness were intercorrelated (bonding, support, communication, and
sucorrance), a positive magnitude was obtained indicating convergence of these constructs.
Factors of Parental Closeness (1) (2) (3) (4)

(1) Bonding 1.00 0.70** 0.62** 0.44**
(2) Communication 1.00 0.57** 0.28**
(3) Support 1.00 0.59**
(4) Succorance 1.00
**p<.05
For divergent validity, a construct should inversely correlate with its opposite factors. For
example, the study by Magno, Lynn, Lee, and Kho (in press) constructed a scale that measures
Mothers’ involvement on their grade school and high school child. The factors of mothers
involvement in school-related activities are intercorrelated. Observe that these factors belong in
the same test but controlling was negatively related permissive and loving is negatively related
with autonomy. This indicates divergence of the factors within the same measure.
Factors of Controlling Permissive Loving Autonomy

Mother’s
Involvement
Controlling ---
Permissive -0.05 ---
Loving 0.05 0.17* ---
Autonomy 0.14* 0.41* -0.36* ---
72
Summary on Validity
Type of Validity Nature Use (Statistical) Procedure

Content Validity Systematic examination of More appropriate for • Items are based on
the test content to determine achievement tests & teacher instructional objectives,
whether it covers a made tests course syllabi & textbooks
representative sample of the • Consultation with
behavior domain to be experts
measured. • Making test-
specifications
Criterion-Prediction Prediction from the test to Hiring job applicants, Test scores are correlated
Validity any criterion situation over selecting students for with other criterion
time interval admission to college, measures ex: mechanical
assigning military personnel aptitude and job
to occupational training performance as a machinist
programs
Construct Validity The extent to which the test Used for personality tests. • Correlate a new test
may be said to measure a Measures that are with a similar earlier test
theoretical construct or trait. multidimensional as measured approximately
the same general behavior
• Factor analysis
• Comparison of the
upper and lower group
• Point-biserial
correlation (pass and fail
with total test score)
• Correlate subtest with
the entire test
Convergent Validity The test should correlate Commonly for personality Multitrait-multidimensional
significantly from variables it measures matrix
is related to
Divergent Validity The test should not correlate Commonly for personality Multitrait-multidimensional
significantly from variables measures matrix
from which it should differ
73
EMPIRICAL REPORT
The Development of the Self-disclosure Scale areas in his or her life have been easy for them
to shell out and what areas need more
Carlo Magno revelations.
Sherwin Cuason It has always been psychologists
Christine Figueroa concern to explain what is going on inside a
De La Salle University-Manila particular individual in relation with his entire
system of personality. One important component
Abstract of looking into the intrinsic phenomenon of
The purpose of the present study is to develop a measure human behavior is self-disclosure. Self-
for self-disclosure. The items were based on a survey
disclosure as defined by Sidney Jourard (1958) is
administered to 83 college students. From the survey 114
items were constructed under 9 hypothesized factors. The the process of making the self known to other
items were reviewed by experts. The main try out form of person; “target persons” are persons whom
the test was composed of 112 items administered to 100 information about the self is communicated. In
high school and college students. The data analysis the process of self-disclosure we make ourselves
showed that the test has a Cronbach’s alpha of .91. The
manifest in thinking and feeling through our
factor loadings retained 60 items with high summated
correlations under five factors. The new factors are beliefs, actions - actions expressed verbally (Chelune,
relationships, personal matters, interests, and intimate Skiffington, & Williams, 1981). In addition,
feelings. Hartley (1993) stressed the importance of
interpersonal communication in disclosing the
Each person has a complex personality self. Hartley (1993) defined self-disclosure as
system. Individuals are oftentimes very much the means of opening up about oneself with other
interested in knowing our personality type, people. Moreover, Norrel (1989) defined self-
attitudes, interests, aptitude, achievement and disclosure as the process by which persons
intelligence. This is the reason why we should make themselves known to each other and occur
develop a psychological test that would help us when an individual communicates genuine
assess our standing. The test we have thoughts and feelings.
developed aims to measure the self-disclosing Generally, self-disclosure is the process
frequency individuals in different areas. This will in which a person is willing to share or open
help them know what areas in their life they are oneself to another person or group whom the
willing to let other people know. This would be a individual can trust and the process is done
good instrument for counselors to use for the verbally. The factors identified in self-disclosure
assessment of their clients. The result of the which are potent areas in the content in
client’s test would help the counselor adjust his communicating superficial or intimate topics are
or her skills eliciting or disclosing more or other (1) Personal matters, (2) Thoughts & ideas, (3)
areas or other topics. Religion, (4) Work, study & accomplishments, (5)
Self-disclosure is a very important aspect Sex, (6) Interpersonal relationship, (7) Emotional
in the counseling process, because self- state, (8) tastes, (9) Problems.
disclosure is one of the instruments the The process of self-disclosure occurs
counselor can use. The consequence of the during interaction with others (Chelume,
client not disclosing himself is their inability to Skiffington, & Williams, 1981). In the studies that
respond to their problem and to the counselor. Jourard (1961;1969) conducted, he stated that a
This is what the researchers took into person will permit himself to be known when “ he
consideration in developing the test. It could also believes his audience is man of goodwill.” There
be used outside the counseling process. An should be a guarantee of privacy that the
individual may want to take it to find out what
74
information disclosed will not escape the circle. is used interchangeably.

Jourard (1971) noted that persons need
to self-disclose to get in touch with their real Areas of Self-disclosure
selves, to have intimate relationships with
people, to bond with others, in pursuit of the truth In terms of the information disclosed, the
of one’s being and to direct their destiny on the researchers arrived with nine hypothesized
basis of knowledge. Jourard agrees with Buber factors based on a survey study conducted.
(1965) that in a humanistic sense of self- These factors are: Interpersonal relationship,
disclosure “we see the index of man functioning thoughts and ideas,
at his highest and truly human level rather than at work/study/accomplishments, sex, religion,
the level of a thing or an animal. “ personal characteristics, emotional state, tastes,
The consequences that follow after self- problems. The factors are reflected on the
disclosure are manifested on its outcomes subjects disposition of being students in which
(Jourard, 1971). The outcomes are: there are influences of social situation of
(1) We learn the extent to which we are schooling and social life.
similar, one to the other, and to the extent to
which we differ from one another in thoughts, Interpersonal Relationship. Interpersonal
feelings, hopes and reactions to the past. relationship is operationally defined as the range
(2) We learn of the other man’s needs, of relationships or bonding formed within the
enabling them to help him or to ensure that his outside the family which include peers, friends,
needs will not be met. and casual acquaintances. Jourard (1971)
(3) We learn the extent to which a man proposed that disclosure of relatively intimate
accords with or deviates from moral and ethical information indicates movement towards greater
standards. intimacy in interpersonal relationships. In
In a survey that the researchers have support, it is indicated that self-disclosure
conducted, a person after disclosing feels better illuminate the process of developing relationships
(42.2%), happy (8.26%), free (5.51%), fine (Hill & Stull, 1981; Altman & Taylor, 1973).
(4.6%), relaxed (3.67%), peaceful (3.67%), okay In terms of gender, it was consistently
(3.67%), lighter (2.75%), calm (2.75%), great proven that women disclose themselves to their
(1.83%), satisfied (1.83%), nothing (6.42%), and same gender to the greater extent that men do.
others (12.88%). Furthermore, it was reported Females have generally been reported to be
that on being transparent or open, individuals feel more disclosing than males (Jourard, 1971;
relieved that a burden was taken off their Chelume et al, 1981; Taylor et al, 1981). Some
shoulders, they experience peace of mind, and studies indicate that individuals who are more
consequently happiness, contact with his or her willing to disclose personal information about
real self, and better able to direct their destiny on themselves also to high-disclosing rather than
the basis of knowledge (Jourard, 1971; low disclosing others (Jourard, 1959; Jourard &
Maningas, 1993). Landsman, 1960; Richman, 1963; Altman &
Cozby (1973) noted that self-disclosure Taylor, 1973).
as an ongoing behavioral process include five It was reported that self-disclosure is
basic parameters: amount of personal significantly and positively related with friendship
information disclosed; intimacy of the information and this relationship is greatest with respect to
disclosed; rate or duration of disclosure; affective intimate topics or superficial information (Rubin &
manner of presentation; and disclosing flexibility, Levy, 1975; Newcomb, 1961; Priest & Sawyer,
these are the appropriate cross-situational 1967). Rubin and Shenker (1975) adapted a
modulation of disclosure. Cozby (1973) further self-disclosure questionnaire of Jourard and
stated that interrelatedness on these parameters Taylor (1971) in which they came up with four
75
new clusters; interpersonal relationship, attitudes, person and to know another person - sexually
sex, and tastes. These clusters contain items on and cognitively - will find the prospective
sensitive information one withholds. The self- terrifying.
disclosure reports are less moderately reliable Sex as a factor in self-disclosure is
(.62 to .72 for men and .51 to .78 for women). included because most closely knitted
In marital relationship, it was found that adolescents gives focal view on sex. The survey
the greater the discrepancy in partners affective study that was conducted shows that 5.26% of
self-disclosure and marital satisfaction ( Levinger males and 3.44% of females disclose themselves
& Senn, 1967; Jorgensen, 1980). regarding sexual matters.
In parent-child relationship it was
reported that there are no differences in the Personal matters about the self.
content of the self-disclosure of Filipino Personal matters consist of private truths about
adolescents with their mother and father (Cruz, oneself and it may be favorable or unfavorable
Custodio, & Del Fierro, 1996). The study also evaluative reaction toward something or
indicated that birth order is highly relevant in someone, exhibited in one’s belief, feelings or
analyzing the content of self-disclosure. The intended behavior.
result of the study also show that children are In an experiment conducted by Taylor,
more disclosing toward the mother because she Gould, and Brounstein (1981), they found that
empathize. the level of intimacy of the disclosure was
determined by (1) dispositional characteristics,
Sex. One of the most intimate topics as (2) characteristics of subjects, and (3) the
a content in self-disclosure is sex. It is usually situation. Their personalistic hypothesis was
embarassing and hard to open to others because confirmed that the level of disclosure affects the
some people have the faulty learning that it is level of intimacy. Some studies also show that
evil, lustful, and dirty (Coleman, Butcher, & some individuals are more willing to disclose
Carson, 1980). But mature individuals view personal information about themselves to high
human sexuality as a way of being in the world of disclosing rather than low disclosing others
men and women whose moments of life and (Jourard, 1959; Jourard & Landsman, 1960;
every aspect of living is spent to experience Jourard & Richman, 1963; Altman & Taylor,
being with the entire world in a distinctly male or 1973). Furthermore, Jones & Archer (1976) have
female way (Maningas, 1995). Furthermore, sought directly that the recipient’s attraction
sexuality is part of our natural power or capacity towards a discloser would be mediated by the
to relate to others. It gives the necessary personalistic attribution the recipient makes for
qualities of sensitivity, warmth, mental respect in the disclosers level of intimacy.
our interpersonal relationship and openness Kelly and McKillop (1996) in their article
(Maningas, 1995). stated that “choosing to reveal personal secrets
Sexuality as being part of our is a complex decision that could have distorting
relationship needs to be opened up or expressed consequences, such as being rejected and
as Freud noted the desire of our instinct or id. alienated from the listener.” But Jourard (1971)
Maningas (1995) stressed out that sex is an noted that a healthy behavior feels “right” and it
integral part of our personal self-expression and should produce growth and integrity. Thus,
our mission of self-communication to others. disclosing personal matters about oneself is a
Some findings by Jourard (1964) on subject means of being honest and seeking others to
matter differences noted that details about one’s understand you better.
sex life is not muchly disclosable as compared
with other factors. Jourard (1964) also noted that Emotional State. One of the factors of
anyone who is reluctant to be known by another self-disclosure defined as one’s revelation of
76
emotions or feelings to other people. A psychological rationale for the selected use of
retrospective study was conducted to determine therapist self-disclosure, the conscious sharing of
what students did to make their developing thoughts, feelings, attitudes, or experiences with
romantic relationship known to social network a patient (Goldstein, 1994).
members and what they did to keep their
relationship from becoming known. It is shown in Religion. We operationally defined
this study that the most frequent reasons for religion in self-disclosure as the ability of an
revelation were felt obligation to reveal based on individual to share his experiences thoughts, and
the relationship with the target, the desire for emotions toward his beliefs about God. Healey
emotional expression, and the desire for (1990) offer an overview of the role of self-
psychological support from the target. The most disclosure in Judeo-Christian religious
frequent reason to withhold information was the experience with emphasis in the process of
anticipation of a negative reaction from the target spiritual direction. In the study done by Kroger
(Baxter, 1993). The researchers felt that the (1994), he shows the catholic confession as the
determination of the probability of self-disclosure embodiment of common sense regarding the
will be a lot better if emotional state is considered social management of personal secrets, of the
as a factor. Emotions, disclosures & health sins committed, and considers confession as a
addresses some of the basic issues of model for understanding the problem of the
psychology and psychotherapy: how people social transmission of personal secrets in
respond to emotional upheavals, why they everyday life. It is very important and considered
respond the way they do, and why translating as a factor in self-disclosure because of the fact,
emotional events into language increases the Filipino people are very religious, and study
physical and mental health (Pennebaker, 1995). shows that religious people disclose more (
Kroger, 1994).
Taste. Is defined as thelikes and Problem
dislikes of a person openned to other people. In When a person is depressed, he tends to
a study made by Rubin & Shenker (1975), they find others that will listen and can share the
made a test studying the friendship, proximity problem with. To release the tension a person
and self-disclosure of college students in the feels, he usually discloses it. Larity of a problem
contexts of being roomates or hallmates. The is attained when people starts to verbalize it and
items were categorized in four clusters, in what in the process, can be reach a solution. In the
we thought would be ascending order of study of Rime (1995), they revealed that after
intimacy-tastes, attitudes, interpersonal major negative life events and traumatic
relationships, and self-concept and sex. This emotional episodes, ordinary emotions, too, are
would help us determine whether people are commonly accompanied by intrusive memories
willing to share superficial information right away and the need to talk about the episode. It also
as well as intimate information. considered the hypothesis that such mental
rumination and social sharing would represent
Thoughts. Is defined as the things in spontaneously initiated ways of processing
mind that one is willing to share with other emotional information.
people. “A friend”, Emerson wrote, “ is a person
with whom I may be sincere. Before him I may Work/Study. Work or study is defined as the
think aloud.” A large number of studies have person’s present duty or responsibility which is
documented the link between friendship and the expected to him and needs to be fulfilled in a
disclosure of personal thoughts and feelings that given time. It is considered a factor in self-
Emerson’s statement implies (Rubin & Shenker, disclosure because this will give a glimpse of
1975). Another study presents a self- how open a person can share his joy and burden
77
in his current responsibility. In the study of Starr Method

(1975), it was hypothesized that self- disclosure
is causally related to psychological and physical Search for Content Domain
well being, with low disclosure related to In the search for content domains, a
maladjustment and high disclosure associated survey was made and answered by 55 females
with mental health. from 16-22 years old. The respondents were
students from the CLA, COE, COS and CBE of
Table 1 DLSU. The survey questionnaire aims to gather
Hypothesized factors of Self-disclosure data about the self-disclosing activities of the
students. The survey questionnaire indicates the
Factor Definition person whom one usually discloses, topics
Emotional state One’s revelation of emotions or
feelings to other people. Feelings,
disclosed, situation where one discloses, how
attitudes toward a situation being one discloses, characteristics while disclosing,
revealed to others. and rate of their own self- disclosing habit. The
Interpersonal Indicates movement towards greater
relationship intimacy in interpersonal
self- disclosure questionnaire by Sidney Jourard
relationships. Range of relationships and Rubin and Shenkers intimacy of self-
or bonding formed within the outside disclosure was reviewed on how they came up
the family.
with their items and factors.
Personal Private truth about oneself,
matters favorable or unfavorable, toward
something or someone and is Item Writing and Review
exhibited in one’s belief, feelings or Based from the survey, 114 items under
intended behavior. Being honest and
seeking others to know you better by nine factors were constructed and the verbal
disclosing. frequency scale was used. The items were
Problems Depressing event or situation that reviewed by two psychology professors and one
can be lightened through disclosing.
Conflict, disagreement experienced psychometrician from De La Salle University.
by an individual. Some items were deleted, some were removed,
Religion Ability of an individual to share his and some were added. After being reviewed the
experiences, thoughts and emotions
toward his feeling of God. Concept, pre-try out form was constructed.
perception and view of religion by an
individual being able to share or Development of the Pretest Form
tackle in the face of others.
Sex As a way of being in the world of The pretest-form consists of 114 items
men and women whose moments of with nine factors. The factors were sex (5 items),
life is spent to experience being with problems (21 items), interpersonal relationship
the entire world in a distinctly male
or female way. Willingness of a (17 items), accomplishments/work/study (14
person to discuss his sexual items), religion (6 items), tastes (8 items),
experiences, needs and views. thoughts (9 items), personal matter (20 items).
Taste Likes and dislikes of a person
opened to other people. Views,
The scaling used is the verbal frequency scale
feeling, appreciation of a person, Always, often, sometimes, rarely, never).
place or thing.
Thoughts Information in mind that you are
willing to share with other people.
Pretryout Form
Perception regarding a thing, or In the pretryout form, 10 forms were
situation which is shared with others. prepared to be answered by 10 respondents
Work/study/ Person’s present duty in which is
accomplishment expected to him. A person’s
conveniently selected, then a feedback is given
responsibility being expected by on vague and not applicable items, and other
others and to be fulfilled in a comments. There were 10 psychology majors
particular time.
who answered the pre-test form (6 females and 4
males).
78
The pretryout form consists of 110 items factors and the reliability was obtained using the
still with nine factors. There were six negative Cronbach’s alpha. The items were grouped using
items (item no. 7, 30, 97, 106, 107, 109) and the Principal Components Analysis.
rest were positive items. The scaling used was
the verbal frequency scale because the test is a Development of the Final form
measure of a habit. The order of the items were In the final form there were 60 items
randomly arranged and the responses are accepted and 62 items were deleted in the item
answered by checking the corresponding scale. analysis due to low factors loadings (below .40).
The purpose of the pretryout form is for mild There were five factors extracted in the Principal
testing 10 subjects and to ask for comments for Components Analysis: Beliefs, relationships,
further revision. personal matters, interests, and intimate feelings.
Development of the Main-tryout form Plan on developing the Norms

The comments made on the pretryout A norm will be used to interpret the
form were considered and the main-tryout form scores. The test is scored based on the
was developed. The main-tryout form was corresponding answer on each item. A score is
consists of 112 items. The test was intended for yield for a particular factor. The raw score will
adolescents because the items were empirically have an equivalent percentile based on a norm.
based on adolescent subjects and it reflects their And a corresponding percentile will have a
usual activities. There were six negative items. remark.
The scaling used was the verbal frequency scale.
The arrangement of items were in random order Test Plan
and the task of the respondent is to check the In administering the test there is no
corresponding scale beside each item. alloted time to answer the test. The respondents
There were 100 respondents who or person taking the test is instructed to shade
answered the test. The respondents were fourth their corresponding answer on the answer sheet.
year highschool students of St. Augustine School There is no right or wrong answer in the test so
in Cavite, their ages ranging from 15 to 16, there respondents should answer as honestly as
were 48 males and females. The rest of the possible.
participants were college students from De La In scoring, the answer Always is
Salle University. equivalent to 5 points, often=4, sometimes=3,
The sampling design is purposive in rarely=2, never=1. All the items are positive
which the respondent’s selection criteria is because all the negative items were removed
should belong to fourth year level in highschool during the item analysis due to low factor
and in college in private schools. During the loadings. The score on each item will be
administration of the test, the researchers summated and there is an equivalent percentile
explained the purpose of the test to the students for a particular score.
and they all agreed to answer. It took the In the interpretation, the garnered
respondents 20 minutes to answer the test. The percentile will have a remark of high frequency,
researchers then reviewed the data after the average frequency, and low frequency.
collection, each test was scored and encoded in A low disclosing individual would mean
the computer. that the particular person never or rarely opens
up his or herself toward others in the particular
Item Analysis and Factor Analysis area.
The 112 items were intercorrelated and An average self-disclosing individual
the factors were extracted the SPSS computer would mean that the particular person have
software. A matrix was made between the opened in general terms about a particular matter
79
only when necessary and on selected others on Table 2

a particular area. Accepted items with their factor loadings
A high self-disclosing individual would
mean that the person has opened and shared Item number Factor 1
item 33 .68766
himself fully and in complete details to others in item 70 .64815
the particular area. The individual will have the item 8 .61846
item 3 .59245
tendency to let himself to be known in all item 20 .55228
dimensions of his or her being. item 98 .53677
item 77 .45061
item 52 .45001
Results item 59 .40504
item 101 .38157
The corrected item-total correlation of the item 18 .32574
62 items have a total correlation of above .30. Factor 2
item 88 .64293
The item-total correlation of accepted items item93 .72024
ranges from .4866 to .3009, the item correlation item95 .6780
item65 .59372
of the deleted items ranges from -.0123 to .2980. item94 .54047
The coefficient alpha reliability is .9134, the item53 .51697
item75 .50285
standard item alpha is .9166. item68 .44658
A correlation matrix was made on the item66 .41453
item76 .41102
112 items, the mean for the interiitem correlation item96 .38957
is .339, the variance is 1821.3782, and the item99 .36690
item11 .36482
standard deviation is 42.6776. The highest Factor 3
intercorrelation of items is .6543 that occurred item111 .29875
item82 .77164
between item number 51 and item number 74. item83 .69717
In the process of factor analysis, the item17 .59079
item10 .54027
hypothesized nine factors were extracted into 18 item104 .45128
factors with an eigenvalue of 1.07878. The item100 .44697
item56 .42587
researchers considered 4% of variance which item60 .39554
offers 5 factors. Table 2 shows the accepted item62 .39486
item69 .32917
items with their factor loadings. item27 .63290
item34 .61822
item39 .58744
Factor 4
item78 .54582
item43 .49976
item01 .49312
item28 .43613
item26 .43205
item35 .42141
item32 .41807
item72 .41475
item06 .35834
item73 .32098
Factor 5
item10 .54207
item100 .446979
item104 .54207
Item17 .59079
item56 .42587
item60 .39554
item62 .38486
item69 .32917
item82 .77164
item83 .69717
80
Table 3 Discussion
Factor Transformation Matrix At first there were nine hypothesized
factor based on a survey, 18 factors were then
FACTOR FACTOR FACTOR FACTOR FACTOR
1 2 3 4 5
extracted with eigenvalues greater than 1.00.
FACTOR .48 -.56 -.25 -.001 -.61 Finally there were a final of five factors with
1
FACTOR .45 .43 -.56 -.49 .19 acceptable factor loadings. The five factors have
2
FACTOR .45 .55 .57 .09 -.39
new labels because the items were rotated
3
FACTOR
differently based on the data on the main tryout.
.4 -.42 .49 -.34 .52
4 Factor 1 contains items about the beliefs on
FACTOR .41 .02 -.21 .79 .39
5 religion, and ideas on a particular topic and it is
labeled as such. Factor 2 contains items
reflecting relationships with friends and it was
The new five factors were given new names labeled as “relationships.” Factor 3 contains
because the contents were different. Factor 1 items about a person’s secrets and attitudes and
was labeled as Beliefs with 11 items, Factor 2 most of the items contains personal matters and
was labeled as relationships with 13 items, it was labeled as such. Factor 4 is a cluster of
Factor 3 labeled as Personal Matters with 13 taste and perceptions so it was labeled as
items, and Factor 4 as intimate feelings with 13 interest. Factor 5 contains feelings about
items, and factor 5 labeled as interests with 10 oneself, problems, love, success, and
items. frustrations, so it was labeled as intimate
feelings. The factors were reliable due to their
Table 4 alpha which are .8031, .7696, .7962, .7922,
New Table of Specification .7979. It only shows that each factor is
consistent with the intended purpose of the
FACTORS Number ITEM RELIABILITY researchers. In the result of factor analysis the
of items NUMBER
items were not equal in each factor, factor 1 has
Factor 1: Beliefs 11 8,101,18, .8031 11 items, factor 2 has 13 items, factor 3 has 13
20,33, 52,
59, 70, 77, items, factor 4 has 10 items and factor 5 has 13
98, 3 items. The five factors account for the areas in
Factor 2: 13 105, 15, .7696 which a particular individual self-discloses.
Relationships 21, 24, 31,
41, 48, 55,
61, 63 79, There were nine hypothesized factors, all
84, 88 of these were disproved, new factors arrived after
Factor 3: Personal 13 11, 111, .7962
factor analysis. The items were reclassified in
Matters 53, 65, 66, every factor and it was given a new name. Only
68, 75, 76,
93, 94, 95,
five factors were accepted following the four
96, 99 percent rating of the eigenvalue. These factors
Factor 4: Intimate 13 1,6, 26, 27, .7922
are Beliefs, Relationships, Interests, Personal
Feelings 28, 32, 34, matters, and intimate feelings. The test we have
35, 39, 43,
72, 73, 78
developed intended to measure the degree of
self-disclosure of individuals but it was refocused
Factor 5: Interests 10 10, 100, .7979 to measure the self-disclosure each person
104, 17,
56, 60, 62, makes on each different areas or factors.
69, 82, 83
60 In terms of the test’s psychometric

property, it has gone in the level of item review
by experts and factor analysis, it has an internal
81
consistency of .9134 which is high. Considering self - disclosure in psychotherapy. In Stricker, G.

that the test has just undergone its initial stages, & Fisher, M. (eds.) Self-disclosure in the
further validation study is recommended to give therapeutic relationship (pp. 17-27). New York,
more detailed properties of the test. Norming and NY, US: Plenum Press.
interpretation for the test is not yet further
established where it needs to be administered to Hill, C. T. & Stull, D. E. (1981). Sex differences in
a large sample size. An intensive study should be effects of social and value similarity in same-sex
made with considerable and appropriate number friendship. Journal of Personality and Social
of respondents. In terms of the sampling a Psychology, 41(3), 488-502.
probabilistic technique is suggested to account
for further generalization in the study because Jones, E. E., & Archer, R. L. (1976). Are there
the current test only used a purposive non- special effects of personalistic self - disclosure?
probabilistic sampling. Journal of Experimental Social Psychology,
12(2), 180-193.
References
Jorgensen, S. R. (1980). Contraceptive attitude -
Altman, I., & Taylor, D. A. (1973). Social behavior consistency in adolescence. Population
penetration: The development of interpersonal & Environment: Behavioral & Social Issues, 3(2),
relationships. New York: Holt, Rinehart & 174-194.
Winston.
Jourard, S. M (1970). Experimenter - subject
Baxter, D. E. (1993). Empathy: Its role in nursing "distance" and self - disclosure. Journal of
burnout. Dissertation Abstracts International, 53, Personality and Social Psychology, 15(3), 278-
4026. 282.
Chelune, G. J., Skiffington, S, & Williams, C. Jourard, S. M. & Jaffe, P. E. (1970). Influence of
(1981). Multidimensional analysis of observers' an interviewer's disclosure on the self - disclosing
perceptions of self - disclosing behavior. Journal behavior of interviewees. Journal of Counseling
of Personality and Social Psychology, 41(3), 599- Psychology, 17(3), 252-257.
606.
Jourard, S. M. & Landsman, M. J. (1960).
Coleman, C., Butcher, A. & Carson, C. (1980). Cognition, cathexis, and the dyadic effect in
Abnormal psychology and modern life (6th ed.). men's self-disclosing behavior. Merrill-Palmer
New York: JMC. Quarterly, 6, 178-185.
Cozby, P. C. (1973). Self - disclosure: A literature Jourard, S. M. & Rubin, J. E. (1968). self -
review. Psychological Bulletin, 79(2), 73-91. disclosure and touching: a study of two modes of
interpersonal encounter and their inter - relation.
Goldstein, J. H. (1994). Toys, play, and child Journal of Humanistic Psychology, 8(1), 39-48.
development. New York, NY, US: Cambridge
University Press. Jourard, S. M. (1959). Healthy personality and
self-disclosure. Mental Hygiene, 43, 499-507.
Hartley, P. (1993). Interpersonal communication.
Florence, KY, US: Taylor & Frances/Routledge. Jourard, S. M. (196). Religious denomination and
self - disclosure. Psychological Reports, 8, 446.
Healey, B. J. (1990). Self - disclosure in religious
spiritual direction: Antecedents and parallels to
82
Jourard, S. M. (1961). Self-disclosure patterns in 13(3), 237-249.

British and American college females. Journal of
Social Psychology, 54, 315-320. Maningas, I. (1995). Moral theology. Manila:
DLSU Press.
Jourard, S. M. (1961). Self-disclosure scores and
grades in nursing college. Journal of Applied Newcomb, T. M. (1981). The acquaintance
Psychology, 45(4), 244-247. process. Oxford, England: Holt, Rinehart &
Winston.
Jourard, S. M. (1964). The transparent self.
Princeton: Van Nostrand, 1964. Pennebaker, J. W. (1995). Emotion, disclosure,
and health: An overview. Emotion, Disclosure, &
Jourard, S. M. (1968). You are being watched. Health, 14, 3-10.
PsycCRITIQUES, 14(3), 174-176.
Priest, R. F. & Sawyer, J. (1967). Proximity and
Jourard, S. M. (1970), The beginnings of self- peership: bases of balance in interpersonal
disclosure. Voices: the Art & Science of attraction. American Journal of Sociology, 72(6),
Psychotherapy, 6(1), 42-51. 633-649.
Richman, S. (1963). Because experience can't
Jourard, S. M. (1971). Self - disclosure: An be taught. New York State Education, 50(6), 18-
experimental analysis of the transparent self. 20.
Oxford, England: John Wiley.
Rimé, B. (1995). The social sharing of emotion
Jourard, S. M., & Landsman, M. J. (1960). as a source for the social knowledge of emotion.
Cognition, cathexis, and the "dyadic effect" in In Russell, J. A., Fernández-Dols, J., Manstead,
men's self-disclosing behavior. Merrill-Palmer A., & Wellenkamp, J. C. (eds). Everyday
Quarterly, 6, 178-186. conceptions of emotion: An introduction to the
psychology, anthropology and linguistics of
Jourard, S. M., & Resnick, J. L. (1970). Some emotion (pp. 475-489). NATO ASI series D:
effects of self - disclosure among college women. Behavioural and social sciences, Vol. 81. New
Journal of Humanistic Psychology, 10(1), 84-93. York, NY, US: Kluwer Academic/Plenum
Publishers.
Jourard, S. M., & Richman, P. (1963). Disclosure
output and input in college students. Merrill- Rubin, J. A. & Levy, P. (1975). Art-awareness: A
Palmer Quarterly, 9, 141-148. method for working with groups. Group
Psychotherapy & Psychodrama, 28, 8-117.
Kelly, A. E. & McKillop, K. J. (1996). Rubin, Z. (1970). Measurement of romantic love.
Consequences of revealing personal secrets. Journal of Personality and Social Psychology, 16,
Psychological Bulletin, 120(3), 450-465. 265-273.
Kroger, R. O. (1994). The Catholic Confession Starr, P. D. (1975). Self - disclosure and stress
and everyday self - disclosure. In Siegfried, J. among Middle - Eastern university students.
(ed). The status of common sense in psychology Journal of Social Psychology, 97(1), 141-142.
(pp. 98-120). Westport, CT, US: Ablex
Publishing. Taylor, D. A., & Gould, R. J., & Brounstein, P. J.
(1981). Effects of personalistic self - disclosure.
Levinger, G. & Senn, D. J. (1967). Disclosure of Personality and Social Psychology Bulletin, 7(3),
feelings in marriage. Merrill-Palmer Quarterly, 487-492.
83
Exercise
Give the best type of reliability to use in the following cases.
___________________1. A scale measuring motivation was correlated on a scale measuring

laziness, a negative coefficient was expected.
___________________2. An achievement test on personality theories was administered to

psychology majors, and the same test was administered among engineering students who have
not taken the scores. It is expected that there would be a significant difference on the mean
scores of the two groups.
___________________3. The 16 PF that measures 16 personality factors were intercorrelated

with the 12 factors of the Edwards personality Preference Schedule (EPPS). Both instruments are
measures of personality but contains different factors.
___________________4. The multifactorial metamemory questionnaire (MMQ) arrived with

three factors when factor analysis was conducted. It had a total of 57 items that originally belong
to 5 factors.
___________________5. The scores on the depression diagnostic scale was correlated with the
Minnesota Multiphasic Personality Inventory (MMPI). It was found that clients who are
diagnosed to be depressive have high scores on the factors of MMPI.
___________________6. The scores of Mike’s mental ability taken during fourth year high
school was used in order to determine whether he will be qualified to enter in the college he
want to study.
___________________7. Maria who went for drug rehabilitation was assessed using the self-
concept test and her records in the company where she was working at were requested that
contains her previous security scale scores. The two tests were compared.
___________________8. Mrs. Ocampo a math teacher before preparing her test constructs a
table of specifications and after making the items it was checked by her subject area coordinator.
___________________9. In an experiment, self disclosure of participants were obtained by

having three raters listen to the recordings between a counselor and client having a counseling
session. The raters used an ad hoc self-disclosure inventory and later their ratings were compared
using the coefficient of concordance. The concordance indicates whether the three raters agree
on their ratings.
___________________10. A test measuring “sensitivity” was constructed in order to establish

its reliability the scores for each items were entered in a spreadsheet to determine whether the
responses for each were consistent.
84
___________________11. The items of a newly constructed personality test measuring Carl

Jung’s psychological functions used a lickert scale. The scores for each item were correlated
with all possible combinations.
___________________12. A test on science was made by Ms. Asuncion a science teacher.

After scoring each test she determined the internal consistency of items.
___________________13. In a battery of tests, the section A class received both the Strong
Vocational Interest Blank (SVIB) and the Jackson Vocational Interest Survey (JVIS). Both are
measures of vocational interest and the scores are correlated to determine if one measures the
same construct.
___________________14. The Work Values Inventory (WVI) was separated into 2 forms and
two set of scores were generated. The two set of scores were correlated to see if they measure the
same construct.
___________________15. Children’s moral judgment was studied if it would change overtime.

It was administered during the first week of classes then another at the end of the first quarter.
___________________16. The study of values was deigned to measure 6 basic interests,

motives, or evaluative attitudes such as theoretical, economic, aesthetic, social, political, and
religious. These six factors were derived after a validity analysis.
___________________17. When the EPPS items were presented in a free choice format, the
scores correlated quite highly with the scores obtained with the regular forced-choice form of
the test.
___________________18. The two forms of the MMPI (Form F and form K scales) were
correlated to detect faking or response sets.
___________________19. In a study by Miranda, Cantina and Cagandahan (2004) they

intercorrelated the 15 factors of the Edwards Personal Preference Inventory.
85
Lesson 3
Item Difficulty and Item Discrimination
Students are usually keen in determining whether an item is difficult or easy and whether
the test is a good test or a bad test based on their own judgment. A test item being judged as easy
or difficult is referred to as item difficulty and whether a test is good or bad is referred to as item
discrimination. Identifying a test items’ difficulty and discrimination is referred to as item
analysis. Two approaches will be presented in this chapter on item analysis: Classical Test
Theory (CTT) and Item Response Theory (IRT). A detailed discussion on the difference between
the CTT and IRT is found at the end of Lesson 3.
Classical Test Theory
Regarded as the “True Score Theory.” Responses of examinees are due only to variation in ability of
interest. All other potential sources of variation existing in the testing materials such as external
conditions or internal conditions of examinees are assumed either to be constant through rigorous
standardization or to have an effect that is nonsystematic or random by nature. The focus of CTT is the
frequency of correct responses (to indicate question difficulty); frequency of responses (to examine
distracters); and reliability of the test and item-total correlation (to evaluate discrimination at the item
level).
Item Response Theory
Synonymous with latent trait theory, strong true score theory, or modern mental test theory. It is more
applicable to for tests with right and wrong (dichotomous) responses. It is an approach to testing based
on item analysis considering the chance of getting particular items right or wrong. In IRT, each item on a
test has its own item characteristic curve that describes the probability of getting each particular item
right or wrong given the ability of the test takers (Kaplan & Saccuzzo, 1997).
Item difficulty is the percentage of examinees responding correctly to each item in the
test. Generally, an item difficulty is difficult if a large percentage of the test takers are not able to
answer it correctly. On the other hand, an item is easy if a large percentage of the test takers are
able to answer it correctly (Payne, 1992).
Item discrimination refers to the relation of performance on each item to performance on

the total score (Payne, 1992). An item can discriminate if most of the high-scoring test takers are
able to answer the item correctly and an item will have a low discriminating power if the low-
scoring test takers can equally answer the test item correct as contrasted with the high-scoring
test takers.
Procedure for Determining Index of Item Difficulty and Discrimination
1. Arrange the test papers in order from highest to lowest.

86
2. Identify the high and low scoring group by getting the upper 27% and lower 27%. For
example there are 20 test takers, the 27% of the 5 test takers is 5.4, rounding it off will give 5 test
takers. This means that the top 5 (high scoring test-takers) and the bottom 5 (low scoring test-
takers) test takers will be included in the item analysis.
3. Tabulate the correct and incorrect responses of the high and low test-takers for each item. For
example, in the table below there are 5 test takers in the high group (test takers 1 to 5) and 5 test
takers in the low group (test takers 6 to 10). Test taker 1 and 2 in the high group got a correct
response for items 1 to 5. Test taker 3 was wrong in item 5 marked as “0.”
Item 1 Item 2 Item 3 Item 4 Item 5 Total

High Test taker 1 1 1 1 1 1 5
test Test taker 2 1 1 1 1 1 5
takers Test taker 3 1 1 1 1 0 4
Group Test taker 4 1 0 1 1 0 4
Test taker 5 1 1 1 0 0 3
Total 5 4 5 4 2
Low Test taker 6 1 1 0 0 0 2
group Test taker 9 1 0 0 0 0 1
Test taker 10 0 0 0 1 0 1
Total 3 3 1 1 0
4. Get the total correct response for each item and convert it into a proportion. The proportion is
obtained by dividing the total correct response of each item to the total number of test takers in
the group. For example, in item 2, 4 is the total correct response and dividing it by 5 which is the
total test takers in the high group will give a proportion of .8. The procedure is done for the high
and low group.
pH = Total Correct Response pL = Total Correct Response

N per group N per group
Item 1 Item 2 Item 3 Item 4 Item 5 Total

High Test taker 1 1 1 1 1 1 5
Group Test taker 4 1 0 1 1 0 4
Test taker 5 1 1 1 0 0 3
Total 5 4 5 4 2
Proportion of the 1 .8 1 .8 .4
High Group (pH)
Low Test taker 6 1 1 0 0 0 2
group Test taker 9 1 0 0 0 0 1
Test taker 10 0 0 0 1 0 1
Total 3 3 1 1 0
Proportion of the .6 .6 .2 .2 0
low group (pL)
87
5. Obtain the item difficulty by adding the proportion of the high group (pH) and proportion of
the low group (pL) and dividing by 2 for each item.
pH − pL
Item difficulty =
2

High Group (pH)
Proportion of the low .6 .6 .2 .2 0
group (pL)
Item difficulty .8 .7 .6 .55 .2
Interpretation Easy item Easy item Average item Average item Difficult item
The table below is used to interpret the index of difficulty. Given the table below, items 1 and 2
are easy items because they have high correct response proportions for both high and low group.
Items 3 and 4 are average items because the proportions are within the .25 and .75 middle bound.
Item 5 is a difficult item considering that there are low proportions correct for the high and low
group. In the case of item 5, only 40% are able to answer in the high group and none got it
correct in the low group (0). Generally as the index of difficulty reaches a value of “0,” the more
difficult an item is, as it reaches “1,” it becomes easy.
Difficulty Index Remark

.76 or higher Easy Item
.25 to .75 Average Item
.24 or lower Difficult Item
6. Obtain the item discrimination by getting the difference between the proportion of the high
group and proportion of the low group for each item.
Item discrimination=pH – pL

High Group (pH)
Proportion of the low .6 .6 .2 .2 0
group (pL)
Item discrimination .4 .2 .8 .6 .4
Interpretation Very good item Reasonably Very good item Very good item Very good item
good item
The table below is used to interpret the index discrimination. Generally, the larger the difference
between the proportion of the high and low group, the item becomes good because it shows a
large gap in the correct response between the high and low group as shown by items 1, 3, 4, and
5. In the case of item 2, a large proportion of the low group (60%) got the item correct as
contrasted with the high group (80%) resulting with a small difference (20%) making the item
only reasonably good.
88
Index discrimination Remark

.40 and above Very good item
.30 - .39 Good item
.20 - .29 Reasonably Good item
.10 - .19 Marginal item
Below .10 Poor item
Analyzing Item Distracters
Analyzing item distracters involve determining whether the options in a multiple

response item type are effective. In multiple response types such as a multiple choice, the test
taker will choose from among the options or distracters the correct answer. In creating
distracters, the test developer ensures they belong in the same category where they are close to
the answer. For example:
What cognitive skill is demonstrated in the objective “Students will compose a five paragraph
essay about their reflection on modern day heroes”?
a. Understanding
b. Evaluating
c. Applying
d. Creating
Correct answer: d
The distracters for the given item are all cognitive skills in Bloom’s revised taxonomy where all
can be a possible answer but there is one best answer. In analyzing whether the distracters are
effective, the frequency of examinees selecting each option is reported.
Group Group Options Total no. Difficulty Discrimination

size of correct Index Index
a b c d
High 15 1 3 1 10 17 .57 .20
Low 15 1 6 1 7
Correct answer
For the given item with the correct answer of letter d, majority of the examinees in the
high and low group preferred option “d” which is the correct answer. Among the high group,
distracters a, b, and C are not effective distracters because there are very few examinees who
selected them. For the low group, option “b” can be an effective distracter because 40% (6
examinees) of the examinees selected it as their answer as opposed to 47% (7 examinees) of
them got the correct answer. In this case distracters “a” and “c” need some revision by making it
close to the answer to make it more attractive for test takers.
89
EMPIRICAL REPORT
Construction and Development of a Test Mapa; (4) Mga Direksyon; (5) Anyong Lupa at
Instrument Anyong Tubig; (6) Simbolong Ginagamit sa
Carlo Magno Mapa; (7) Panahon at Klima; (8) Mga Salik na
may Kinalaman sa Klima; (9) Mga Pangunahing
Abstract Hanapbuhay sa Bansa; (10) Pag-aangkop sa
This study investigated the psychometric properties and Kapaligiran. The topics were based upon the
item analysis of a one-unit test in geography for grade lessons provided by the Elementary Learning
three students. The skills and contents of the test were Competence from the Department of Education.
based on the contents covered for the first quarter that is The test aims for the students to: (1)
indicated in the syllabus. A table of specifications was Identify the important concepts and definitions;
constructed to frame the items into three cognitive skills (2) comprehend and explain the reasons for
that include knowledge, comprehension, and application. given situations and phenomena; (3) Use and
The test has a total of 40 items on 10 different test types. analyze different kinds of maps in identifying
The items were reviewed by a social studies teacher and important symbols and familiarity of places.
academic coordinator. The split-half reliability was used
and a correlation of .3 was obtained. Each test type was Method
correlated and resulted from low and high coefficients. The
item analysis showed that most of the items turned out to Search for Skills and Content Domain
be easy and most are good items. The skills and contents of the test were
identified based on the topics covered for grade
The purpose of this study is to construct three students in the first quarter. The test is
and analyze the items of a one-unit geography intended to be administered for the first quarter
test for grade three students. The test basically exam. The skills intended for the first quarter’s
measures grade three student’s achievement on topic include identifying concepts and terms,
Philippine Geography for the first quarter that comprehending explanations, applying principles
served as a quarterly test. The test when on situations, using and analyzing maps,
standardized through validation and reliability synthesizing different explanations for a
would be used for future achievement test in particular event, and evaluating the truthfulness
Philippine Geography. and validity of reasons and statements through
There is a need to construct and inference.
standardize a particular achievement test in In constructing the test, a table of
Philippine Geography since there is none yet specifications was first constructed to plan out
available locally. the distribution of items for each topic and the
The test is in Filipino language because objectives to be gained by the students.
of the nature of the subject. The subject cover
topics on (1) Kapuluan ng Pilipinas; (2) Malalaki
at Maliliit na Pulo ng Bansa; (3) Mapa at Uri ng
90
Table 1. Table of Specification for a unit in Philippine Geography for Grade 3

Nilalaman Natutukoy ang Nauunawaan ang Nagagamit at Total Number
mahahalagang mga dahilan sa nasusuri ang mapa of Items
konsepto at mahahalagang sa pagtukoy ng
kahulugan kapaliwangan sa mga mahahalagang
bawat sitwasyon pananda
Kapuluang Pilipinas 4 4
Malalaki at maliliit na 4 4
pulo ng bansa
Mapa at Uri ng mapa 4 4
Mga direksyon 6 6
Anying lupa at 5 5
Anyong Tubig
Simbolong ginagamit 4 4
sa mapa
Panahon at Klima 2 3 5
Mga salik na may 2 2
kinalaman sa lima
Mga pangunahing 3 3
hanapbuhay ng bansa
Pag-aangkop sa 3 3
kapaligiran
Total Number of Items 11 16 13 40
Percentage 27.5% 40% 32.5% 100
Table of Specifications placed on the knowledge part since there is a

The Table of Specification contains 10 little need for the students to recall and memorize
topics taken which is a unit about Philippine concepts and terms. The main highlight of this
Geography. The 27.5% of the items were placed unit is to gain the ability to explain geographical
for the knowledge level, 40% were placed for principles on Philippine geography and its
comprehension, and 32.5% were placed on the relatedness to our culture.
application level. Most of the items were
concentrated on the comprehension since the Item Writing
main purpose is for the students to understand There were 40 items constructed based
and comprehend the unit on Philippine on the Table of Specification (see Table 1). A 40-
Geography and it is the foundation knowledge for item test is just enough for grade three students
the entire lesson for the school year. Having since it is not too much or few for their capacity.
mastered this base knowledge will help students Also in determining the amount of items to place
explain and give reasons for the next lessons on the test, the attention span and time frame for
that will be taken. Also, most of the items were testing is considered. Basically in the quarterly
distributed on the application level since the test, a particular test on a subject is given a time
students need to learn practically how to use limit of one hour.
maps, and how could they benefit from using The items were based more from what
maps and figures of the unit. Few items were the students gained from the discussion in the
91
classroom, reflection on the topic, work Test Administration

exercises, group works, activities in school, and Respondents. There were 88 grade 3
from the book. students in three sections who took the test for
The items were divided into 10 parts in the purpose of a Quarter Examination. Out of the
the test. Test I contains four items in a True or 88 students, the top 40 students were the ones
False type. Test II contains 5 items in a matching that were included in the sample. There are 11
type of test. Test III contains 2 items in a multiple (27%) respondents each for the upper and lower
choice type and the stem item is bases on a group which scores is subjected for item analysis
figure presented. Test IV contains 4 items within for difficulty and discrimination.
2 situations. Test V contains 4 items in a multiple Procedure. The teacher for grade 3
choice type, a physical map as a basis for Sibika at Kultura directly instructed the two other
answering. Test VI another multiple choice type teachers who will administer the test for the two
and concentrates on the use of different types of other sections. It was kept into consideration the
map. Test VII a short answer type of test in which constancy and the other factors that would affect
the students will supply what direction is asked the students’ performance on the test. The test
from the question base on a map presented was administered simultaneously for the three
containing 6 items. Test VIII a 5-item interpretive classes in the morning as the first test to be
exercise type of test in which a situation is given taken for that day. The students took the test for
and for each situation inferences were listed and one hour, some students were able to finish the
the task of the students is to choose the best test ahead of time, and they were just advised to
inference applicable for the given situation. Test review their work. When the bell rang the teacher
IX a three-item multiple choice type in which the instructed the students to pass their paper
students will answer depending on a figure of a forward. All the test papers were gathered and
Philippine map and whether condition id given. were checked. After a week the students were
Test X a three-point essay question evaluated informed about their results and the top 40
according to the (a) correctness of answer students that were included in the sample for
(1.5pts) ; (b) Explanation (1 pt); and, (c) followed study was informed about the teachers’ concern
instruction (o.5 pt). There were two raters who for their test. A letter of request for the parents
evaluated the answer for the essay type of test. was sent to inform them about the purpose of the
research and the students’ score, the parents
Content Validation replied positively.
The test was content validated and Data-Analysis. The scores were
reviewed by a teacher in Social Studies from tabulated and encoded in so that the computation
Ateneo de Davao. The suggestions were of the results will be easy. The split-half method
considered and the test was revised accordingly. for obtaining the internal consistency among the
Also, before arriving with final draft of test for scores was employed. The odd and the even
administration, it was checked by the Academic items were separated and were correlated in
coordinator of the School where the test will be using the Pearson’s r moment correlation
administered whether the items are appropriate coefficient. The upper and lower groups were
for the level of grade three students and some chosen according to 27% of the lowest and the
typographical errors. In the process of content highest among the 40 respondents. The item
validation, the topics covered and the table of analysis was employed by computing for each
specification was provided in order to determine item’s difficulty and the item discrimination. The
whether the items were generally covered for the remark for each item was then given according to
topics studied. the standards of difficulty and its discrimination,
whether a good item or not. The Coefficient of
Concordance was used in order to inter-rater
92
reliability of the essay type of test. There were scores correlated. The last item was not included
two judges who evaluated and used criteria to since it has no partner item to be correlated with
score the essay part of the test. because the other items were essay type in
which subjected to a different analysis. The low
Result and Discussion coefficient of internal consistency can also be
Reliability accounted with the various types of tests used,
The test’s reliability was generated thus can be accounted with the variation and
through the split-half method by correlating the difference s in the performance of the
odd numbered and even numbered items. The respondents. In other words, the respondents
arrived internal consistency is 0.3, which is low may respond and perform differently for each
but definite correlation among the items. The low type of test.
correlation between the odd and even numbered The nature of the test cannot be
items can be accounted with the different topic measured on its general homogeneity since the
contents within the 40-item test. It should have test contains several topics and several types of
been more appropriate to construct a large pool format responses. Thus, respondents perform
of items for the 10 content topics or factors that differently for different types of test. The test has
the test have, but 40 items is the usual standard 10 types measuring different skills such as
of items of the school for the quarterly test. The identifying the important concepts and definitions,
test has been administered for the purpose of comprehension and explanations on the reasons
quarterly test because the usability of the test is for given situations and phenomena, and using
considered. With regards with this type of and analyzing different kinds of maps in
measure it can only be accounted with the identifying important symbols and familiarity of
reliability of half of the test. This explains the low places.
value of the correlation coefficient. The split-half Although the dilemma is that the content
coefficient is then transformed into a spearman domains included in the test is part of a general
brown coefficient since the correlation is only for topic on Philippine geography. To test the
the half of the test. The resulting Spearman- internal consistency among the 9 different
Brown coefficient is 0.46 which means that the contents, correlation matrix was done.
items have a moderate relationship.
Also, it is a rule of thumb that there
should at least be 30 pairs of scores to be
correlated, but in this case there were only 18
Table2. Intercorrelation among the Nine contents of the Test.
I II III IV V VI VII VIII IX

I --
II -0.13 --
III 0.98* 1 --
IV 0.18 -0.81* -0.48* --
V -0.21 -0.42 0.47* -0.19 --
VI 0.19 0.58* 0.47* 0.6 -0.65* --
VII -0.73 0.28 0.4I* -0.56* 0.73* -0.24 --
VIII 0.07 -0.19 -0.47* 0.96* 0.08 -0.8 -0.25 --
IX 0.85* -0.58* 0.48* 0.15 0.97* -0.52* -0.52* 0.28 --
93
There is a high relationship between test items are dictated on the proportion of the
I and test IX. The higher the scores on students who answered the item correctly. In this
identification of concepts the higher the scores case, most of the respondents got the answer
on comprehension of weather map. Also, a high that is why most of the items turned out to be
relationship existed between test V and test IX. easy. It can be accounted that in general, the test
The higher the scores on the interpretation of a was fairly easy since most of the items turned out
physical map the higher the scores on 76% and above.
interpretation of the weather map. There is also a Also, Table 3 indicated the index
high relationship between test IV and test VIII. discrimination of each item. There were 27%
The higher the scores on the inference about the items that are considered poor. These items
Philippine islands, the higher the scores on the were rejected since most scores is in the high
comprehension on weather. Generally, the range of the low group and some scores of the
results on inter-correlation among the contents low group are near to the scores of the high
showed pretty crude results due to the few items group who have answered it correctly.
and the items for each type of the test were not Considering the poor items such as item 2,4, 9,
equal. The pairing in the computation was done 13, 15, 30, 31, 32, 33, and 34 the pattern is
base on the minimum number of items for each indicative. There are very few marginal items that
test type. are subjected for improvement. There are only
8% (3 items) that are remarked as marginal since
Item Difficulty and Index Discrimination the scores of the low group and the high groups
To evaluate the quality of each type of are almost the same. This means that both the
item in the test, item analysis was done by high and the low group can answer this item
determining each items difficulty and index fairly. 21.6% (8 items) of the items are
discrimination. The proportion of examinees reasonably good items since there is enough
getting each of items correctly was evaluated interval between the high and low groups. Also
according to the scale below. there are few items remarked as good items and
enough to be considered as very good items.
Difficulty Index Remark 16.21% of the items are good items and 24.3%
.76 or higher Easy Item are very good items. There is a pattern that there
.25 to .75 Average Item is a wide distance of scores between the high
.24 or lower Difficult Item group and the low group.
Source: Lamberte, B. (1998). Determining the Scientific
Usefulness of Classroom Achievement Test. Cutting Edge Interrater Reliability
Seminar. De La Salle University. The coefficient of concordance was used
to determine the degree of agreement between
Table 3 indicates each item’s difficulty value and the two raters who judged the essay type in the
discrimination index value. The difficulty index test. The essay type basically measures the
shows a pattern that 67.6% of the items are easy student’s knowledge on the adaptation of farmers
and 32.43% of the test is on the average scale. in farming. The criteria used for rating the essay
Considering that the test was constructed or is that: (a) at least 2 answers are correct (1.5pts);
grade three students the teacher was putting it (b) the answer was explained (1 pt); (c) and the
down on the level of the student’s capacity and instruction on answering was followed (0.5 pt).
ability. But it may also mean that the students The results indicate that here is low agreement
gained mastery of the subject matter that most of between the two raters. A high value of W which
them are able to answer it correctly. It should be is 0.74 was computed indicating close
taken note that the easiness and difficulty of the
94
concordance between the raters. This means the scores on the interpretation of a physical map
that the two raters showed a small variation in the higher the scores on the interpretation of the
rating the answers in the essay. The small error weather map and also the higher the scores on
of variance can be accounted with the difference the inference about the Philippine Islands, the
of the disposition of the two raters. The first rater higher the scores on the comprehension on
was the actual teacher in the subject but the topics about weather. A high correlation
second rater was also an Araling Panlipunan coefficient was found between these types.
teacher but teaching in the higher level. There Although the results may not be too accurate
was a difference on how they view the answer since the basis for the matrix comparison does
even though they talked about the rating not have equal number of items and the
procedure at the start. minimum number of items were the only ones
subjected in the analysis. It is recommended that
Conclusion equal number of items for each test should be
A low internal consistency was made to account a more accurate result in the
generated due to the different subject content in regression analysis. There is also a low
the test and each test measures different skills. agreement between the two raters for the essay
These two factors affected the internal type since they have different perceptions on
consistency of the test. It is indeed difficult to giving points for the answers. The item difficulty
make it entirely uniform since the subject showed the most of the items are easy since the
contents are required as minimum learning students have gained mastery of the subject
competence by the Department of education. matter. The index discrimination showed that the
Also the listed subject contents are the planned items are distributed according to its power.
focus for the first quarter of the schools subject There are almost equal number of items that are
matter budgeting. A multiple regression analysis poor (27%), marginal item (8%), reasonably good
was performed to observe the relationship (22%), good (16%) and very good (24%).
among the test types. It was found that the higher
Table 3. Item Discrimination and Index Discrimination.

Item No. Total High Low PH PL Difficulty Remark Item Remark
Group Group Index Discriminat
ion
1 32 11 7 1 0.636 0.818 Easy Item 0.364 Good
item
2 26 7 6 0.636 0.545 0.591 Average Item 0.091 Poor item
3 34 11 7 1 0.636 0.818 Easy Item 0.364 Good
item
4 38 11 10 1 0.909 0.955 Easy Item 0.909 Poor item
5 36 11 8 1 0.727 0.864 Easy Item 0.273 Reasona
bly Good
item
6 34 11 5 1 0.455 0.727 Average Item 0.545 Very
Good
item
7 33 10 8 0.909 0.727 0.818 Easy Item 0.182 Marginal
item
95

bly Good
item
9 39 11 10 1 0.634 0.955 Easy Item 0.091 Poor item
10 24 9 4 0.818 0.456 0.591 Average Item 0.455 Very
Good
item
11 23 9 5 0.818 0.273 0.636 Average Item 0.364 Good
item
Good
item
13 36 11 9 0.909 0.727 0.864 Easy Item 0.091 Poor item
14 34 11 8 1 1 0.864 Easy Item 0.273 Marginal
item
15 39 10 11 1 1 1 Easy Item 0 Poor item
Good
item
Good
item
18 34 10 7 1 0.636 0.818 Easy Item 0.364 Good
item
Good
item
bly Good
item
Good
item
Good
item
23 33 11 7 1 0.636 0.818 Easy Item 0.364 Good
item
bly Good
item
bly Good
item
26 24 11 9 1 0.364 0.909 Easy Item 0.182 Marginal
item
27 37 11 4 0.636 0.818 0.5 Average Item 0.273 Reasona
bly Good
96
item
item
29 39 11 7 1 0.909 0.818 Easy Item 0.364 Good
item
30 40 11 10 1 1 0.955 Easy Item 0.091 Poor item
item
Good
item
37 36 11 7 1 0.636 0.818 Easy Item 0.364 Good
item
Item Response Theory: Obtaining Item difficulty Using the Rasch Model
It is said that the IRT is an approach to testing based on item analysis considering the
chance of getting particular items right or wrong. In IRT, each item on a test has its own item
characteristic curve that describes the probability of getting each particular item right or wrong
given the ability of the test takers (Kaplan & Saccuzzo, 1997). This will be realized at the latter
section in the computational procedure.
In using the Rasch model as an approach for determining item difficulty, the calibration
of test item difficulty is independent of the person used for the calibration unlike in the classical
test theory approach where it is dependent on the group. The method of test calibration does not
matter whose responses to these items use for comparison. It gives the same results regardless on
who takes the test. The score a person obtains on the test can be used to remove the influence of
their abilities from the estimation of their difficulty. Thus, the result is a sample free item
calibration.
Rasch’s (1960), the proponent who derived the technique, intended to eliminate
references to populations of examinees in analyses of tests unlike in classical test theory where
norms are used to interpret test scores. According to him that test analysis would only be
worthwhile if it were individual centered with separate parameters for the items and the
examinees (van der Linden & Hambleton, 2004).
The Rasch model is a probabilistic unidimensional model which asserts that: (1) The
easier the question the more likely the student will respond correctly to it, and (2) the more able
the student, the more likely he/she will pass the question compared to a less able student. When
the data fit the Rasch model, the relative difficulties of the questions are independent of the
relative abilities of the students, and vice versa (Rasch, 1977).
As shown in the graph below (Figure 1), a function of ability (θ) which is a latent trait
forms the boundary between the probability areas of answering an item incorrectly and
answering the item correctly.
97
Figure 1
Item Characteristic Curves of an 18-item Mathematical Problem Solving Test
Easy item
Easy item
Easy item
Easy item
Difficult item
Difficult item
Difficult item
In the item characteristic curve, the score on the item represents ability (θ) and the x-axis is the
range of item difficulties in log functions. It can be noticed that items 1, 7, 14, 2, 8, and 15 do not
require high ability to be answered correctly as compared t items 5, 12, 18, and 11 that require
high ability. The item characteristic curves are judged within 50% of the ability and a cut off of
“0” on item discrimination. The curves within the left side of the “0” item difficulty as marked in
the 50% ability are easy items and the ones on the right side are difficult items. The program
called WINSTEPS was used to produce the curves.
The IRT Rasch model basically identifies the location of a persons’ ability in a set of
items for a given test. The test items has a predefined set of difficulties, the person’s position
should be reflective that his ability should be matched with the difficult of the items. The ability
of the person as symbolized by θ and the items as δ. In the figure below, there are 10 items (δ1 to
98
δ10), and the location of the person’s ability (θ) is in between δ7 and δ8. In the continuum, the
items are prearranged from the easiest (at the left) to the most difficult (at the right). If the
position of the person’s ability is between δ7 and δ8, then it is expected that the person taking the
test should be able to answer items δ1 to δ6 (“1” correct response, “0” incorrect response), since
this items are answerable given the level of ability of the person. This kind of calibration is said
to fit the Rasch model where the position of the person’s ability is within a defined line of item
difficulties.
Case 1
In Case 2, the person is able to answer four difficult items and unable to respond correctly with
the easy items. There is now difficulty in locating the person in the continuum. If the items are
valid measures of ability, then the easy items should be more answerable than the difficult ones.
This means that the items are not suited for the person’s ability. This case do not fit the Rasch
model.
Case 2
The Rasch model allow to estimate person ability (θ) through their score on the test and
the item’s difficulty (δ) through the item correct separately that’s why it is considered to be test
free and sample free.
In different cases, it can be encountered that the person’s response (θ) to the test is higher
than the specified item difficulty (δ), so their difference (θ–δ) is greater than zero. But when the
ability or response (θ) is less than the specified item difficulty (δ), their difference (θ–δ) is less
than 0 as in Case 2. When the ability of the person (θ) is equivalent to the item’s difficulty (δ),
the difference (θ–δ) is 0 as in Case 1. This variation in person responses and item’s difficulty is
99
represented in an Item Characteristic Curve (ICC) which show the way the item elicits responses
from persons of every ability (Wright & Stone, 1979).
Figure 1
ICC of a Given Ability and Item Difficulty
An estimate of response x is obtained when a person with ability (θ) is acting on an item with
diffuculty (δ). It can be specified in the model that in the interaction between ability (θ) and item
difficulty (δ) that when ability is greater than the difficulty, the probability of getting the correct
answer is more than .5 or 50%. When the ability is less than the difficulty, the probability of of
getting the correct answer is less than .5 or 50%. The variation of these estimates on the
probability of getting a correct response is illustrated in Figure 1. The mathematical units for θ
and δ are defined in logistic functions (ln) to produce a linear scale and generality of measure.
The next section guides you in estimating the calibration of item difficulty and person
ability measure.
Procedure for the Rasch Model
The Rasch model will be used for the responses of 10 students in a 25 item problem
solving test. In determining the item difficulty in the Rasch model, all participants who took the
test are included unlike the classical test theory where the upper and lower 27% are the only ones
included in the analysis.
100
ITEM NUMBER
Examinees 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 total
9 0 1 1 0 0 1 1 1 1 0 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 20
10 1 1 0 0 1 1 0 0 1 1 0 0 1 0 0 0 1 1 0 1 0 1 1 0 1 13
5 1 0 0 0 0 0 1 0 0 1 0 0 1 1 0 0 1 1 1 0 0 1 0 1 1 11
3 0 0 1 0 0 1 1 1 0 0 0 0 1 1 1 0 0 0 1 0 0 1 0 0 1 10
8 1 0 1 0 1 1 1 0 0 1 1 0 0 0 0 0 0 0 0 1 0 0 1 0 1 10
1 1 0 1 0 0 0 1 1 0 0 0 0 1 1 1 0 0 0 0 0 1 0 0 0 1 9
6 0 1 0 1 1 0 1 0 0 0 0 0 0 0 0 0 0 0 1 1 0 1 1 0 1 9
7 0 0 1 0 0 0 1 1 0 1 0 1 0 0 1 0 0 0 1 1 0 0 0 0 1 9
4 1 0 0 0 0 0 0 0 0 0 1 1 1 0 1 0 1 0 0 0 0 0 1 0 1 8
2 0 0 1 0 0 0 1 1 0 0 0 0 0 1 1 0 0 0 1 0 0 0 0 0 1 7
Total 5 3 6 1 3 4 8 5 2 4 3 3 6 5 6 1 4 2 6 5 2 5 5 2 10
Grouped Distribution of Different Item Scores
1. Code each score for each item as “1” for right answer and “0” for wrong answer.
2. Arrange the scores (persons) from highest to lowest
3. Remove items where the all respondents got it correct
4. Remove items where the all respondents got it wrong
5. Rearrange scores (person from highest to lowest).
6. Group the items with similar total item score (si)
7. Indicate the frequency (fi) of items for each group of items
 s 
8. Divide each total item score (si) with N (ρi) = proportion correct  ρ i = i 
 n
9. Subtract 1 with the ρi = proportion incorrect (1 – ρi)
10. Divide the proportion incorrect with the proportion correct and get the natural log of this
 1 − ρ i  
using a scientific calculator = logit incorrect (xi)  xi = ln   
 ρ
 i 
11. Multiply the frequency (fi) with the Logit incorrect (xi)=fixi
12. Square the xi and multiply with each fi=fixi2
13. Compute for the value of x•
Σf x
x• = i i
Σf i
14. To get the initial item calibration (doi) subtract the logit incorrect (xi) with the x•
(doi= xi- x•)
15. Estimate the value of U which will be used later in the final estimates.
Σf i ( xi ) 2 − [(Σfi )( x•) 2 ]
U=
Σf i − 1
101
Table 1
Grouped Distribution of the 7 Different Item Scores of 10 Examinees
item
initial
score item item item Proportion proportion logit Frequency X Frequency
item
group name score frequency correct incorrect incorrect logit X logit2
calibration
index
si fi ρi 1-ρi xi fixi fi(xi)2 doi= xi- x•

1 7 8 1 0.8 0.2 -1.39 -1.39 1.92 -1.87
3, 13,
2 15, 19 6 4 0.6 0.4 -0.41 -1.62 0.66 -0.89
1, 8,
14,
20,
3 22, 23 5 6 0.5 0.5 0.00 0.00 0.00 -0.48
6, 10,
4 17 4 3 0.4 0.6 0.41 1.22 0.49 -0.07
2, 5,
5 11, 12 3 4 0.3 0.7 0.85 3.39 2.87 0.37
9, 18,
6 21, 24 2 4 0.2 0.8 1.39 5.55 7.69 0.91
7 4, 16 1 2 0.1 0.9 2.20 4.39 9.66 1.72
Σfi(xi)2
Σfi=24 Σ fixi =11.54 =23.29
Σf i x i 11.54
x• = x• = x• = 0.48
Σf i 24
Σf i ( xi ) 2 − [(Σfi )( x•) 2 ] 23.29 − [( 24)(0.48) 2 ]

U= U= U = 0.77
Σf i − 1 24 − 1
Grouped Distribution of Observed Person Scores
16. Count the number of possible scores (r) for each of the person’s total score (L).
17. Count the number of persons for each possible score=person frequency (nr)
 r
18. Divide each possible score with the total score = proportion correct  ρ r = 
 L
19. Obtain the proportion incorrect by subtracting the proportion correct with 1 (1-ρr)
20. Determine the logit correct (yr) of the quotient between proportion correct (ρr)and
  ρ 
proportion incorrect (1-ρr)  y r = ln  r  
 1 − ρ r  
21. Multiply the logit correct (yr)with each person frequency (nr) (nryr)
22. Square the values of the logit correct (yr2) and multiply with the person frequency (nr)
23. The logit correct (yr)is the initial person measure (bro=yr)
24. Compute for the value of y• and V to be used later in the final estimates.
102
Table 2
Grouped Distribution of Observed Examinee Scores on the 24 Item Mathematical Problem
Solving Test
Initial
Possible Person Proportion Logit Frequency X Person
score frequency correct correct Frequency X Logit Logit2 Measure
nr ρr yr nryr nr(yr)2 bro=yr
7 1 0.29 -0.89 -0.89 0.79 -0.89
8 1 0.33 -0.69 -0.69 0.48 -0.69
9 3 0.38 -0.51 -1.53 0.78 -0.51
10 2 0.42 -0.34 -0.67 0.23 -0.34
11 1 0.46 -0.17 -0.17 0.03 -0.17
12 0 0.50 0.00 0.00 0.00 0.00
13 1 0.54 0.17 0.17 0.03 0.17
14 0 0.58 0.34 0.00 0.00 0.34
15 0 0.63 0.51 0.00 0.00 0.51
16 0 0.67 0.69 0.00 0.00 0.69
17 0 0.71 0.89 0.00 0.00 0.89
18 0 0.75 1.10 0.00 0.00 1.10
19 0 0.79 1.34 0.00 0.00 1.34
20 1 0.83 1.61 1.61 2.59 1.61
Σnr=10 Σnryr =-2.18 Σnr(yr)2= 4.92
Σn r y r − 2.18
y• = y• = y•= -0.22
Σn r 10
Σnr ( y r ) 2 − Σn r ( y •) 2 4.92 − [10( −0.22) 2 ]

V = V= V = 0.49
Σn r − 1 10 − 1
Final Estimates of Item difficulty
25. Compute for the expansion factor (Y)

V 0.49
1+ 1+
Y = 2.89 Y = 2.89 Y = 1.11
UV (0.77)(0.49)
1− 1−
8.35 8.35
V = 0.49 (from Table 2)

U = 0.77 (from Table 1)
103
26. Multiply the expansion factor (Y) with the initial item calibration (dio). The item score
group index, item name, and initial item calibration is taken from Table 1.
27. Compute the Standard Error (SE) for each item scores
N
SE ( d i ) = Y
S i ( N − Si )
Table 3
Final Estimates of Item Difficulties from 10 Examinees
Sample
Spread
Item Score Initial Item Expansion Corrected Item Calibration
Group Index Item Name Calibration Factor Calibration Item Score Standard Error
i d oi Y di=Ydoi si SE (di)
1 7 -1.87 1.11 -2.07 8 0.878
2 3, 13, 15, 19 -0.89 1.11 -0.98 6 0.717

1, 8, 14, 20,
3 22, 23 -0.48 1.11 -0.53 5 0.702
4 6, 10, 17 -0.07 1.11 -0.08 4 0.717
5 2, 5, 11, 12 0.37 1.11 0.41 3 0.766
6 9, 18, 21, 24 0.91 1.11 1.01 2 0.878

7 4, 16 1.72 1.11 1.91 1 1.170
N=10
Final Estimates of Person measures
28. Compute for the value of X
U 0.77
1+ 1+
X = 2.89 X = 2.89 X = 1.18
UV (0.77)(0.49)
1− 1−
8.35 8.35
V = 0.49 (from Table 2)

U = 0.77 (from Table 1)
29. Multiply the expansion factor (X) with each of the initial measure (bro) = corrected
measure to obtain the corrected measure (br). The possible score and initial measure is
taken from Table 2.
104
30. Compute for the Standardized Error (SE).
L
SE = X
r( L − r)
Test width corrected Measure

Possible score Initial measure expansion factor measure standard error nr
r b ro X br=Xbro
7 -0.89 1.18 -1.05 0.53 1
8 -0.69 1.18 -0.82 0.51 1
9 -0.51 1.18 -0.60 0.50 3
10 -0.34 1.18 -0.40 0.49 2
11 -0.17 1.18 -0.20 0.48 1
12 0.00 1.18 0.00 0.48 0
13 0.17 1.18 0.20 0.48 1
14 0.34 1.18 0.40 0.49 0
15 0.51 1.18 0.60 0.50 0
16 0.69 1.18 0.82 0.51 0
17 0.89 1.18 1.05 0.53 0
18 1.10 1.18 1.30 0.56 0
19 1.34 1.18 1.57 0.59 0
20 1.61 1.18 1.90 0.65 1
105
Figure 3
Item Map for the Calibrated Item Difficulty and Person Ability
Score
items logit persons
(item 7)1 -2.07 0
7
0 -1.05 1 (Case2)
Item 3 Item 13 Item 15 (Item 19) 4
z=1.4 z=1.0 z=.6 z= -.5 -0.98 0
8
Items that do not 0 -0.82 1 (Case 4)
9
Require high ability 0 -0.60 3 (Case 1) Case 6 Case 7
(δ<θ) Item 1 item 8 Item 14
Z=.4 z=-.1 Z=.2 Item 20 Item 22 (Item 23) 6 -0.53 0
10
0 -0.40 2 (Case 3) Case 8
11
0 -0.20 1 (Case 5)
Item 6 Item 10 (Item 17) 3
Z=0 Z= -.9 -0.08 0
δ=θ 0 0.00 0
13
0 0.20 1 (Case 10)
0 0.40 0
Item 2 Item 5 Item 11 (Item 12) 4
Z=.6 Z=.3 Z= -.8 Z= -.12 0.41 0
0 0.60 0
Items that 0 0.82 0
require high ability Item 9 Item 18 Item 21 (Item 24) 4
(δ >θ) Z=.8 Z=1.0 1.01 0
0 1.05 0
0 1.30 0
0 1.57 0
20
0 1.90 1 (Case 9)
Item 16 (Item 4) 7
Z= -.1 1.91 0
Figure 3 shows the item map of calibrated item difficulty (left side) and person ability (right
side) across their logit values. Observe that as the items become more difficult (increasing logits)
the person with the highest score (high ability) is matched close with the item. This match is
termed as a goodness of fit in the Rasch model. A good fit is indicates that difficult items require
high ability to be answered correctly. More specifically the match in the logits of person ability
and item difficulty indicates a goodness of fit. In this case the goodness of fit of the item
difficulties are estimated using the z value. Lower z and non significant z values indicates a
goodness of fit of the item difficulty and person ability.
106
EMPIRICAL REPORT
The Application of a One-Parameter IRT description. Reitman's discussion described a

Model on a Test of Mathematical Problem problem solver as a person perceiving and
Solving accepting a goal without an immediate means of
Carlo Magno reaching the goal. Henderson and Pingry (1953)
Chang Young Hai wrote that to be problem solving there must be a
goal, a blocking of that goal for the individual,
Abstract and acceptance of that goal by the individual.
The purpose of this research was to examine the validity of What is a problem for one student may not be a
a Mathematical Problem Solving Test for fourth year high
school students and to compare traditional and Rasch-
problem for another -- either because there is no
based scores in their ability. The Mathematical Problem blocking or no acceptance of the goal.
Solving test was administered to 31 fourth year high school Schoenfeld (1985) also pointed out that defining
students studying in two Chinese schools, and the data what is a problem is always relative to the
were submitted to Rasch analysis. Traditional and Rasch- individual.
based scores for a sample of fourth year high school
students were submitted to analyses of variance with group
The measure of mathematical ability
by comparing log and SE values across the test. Twenty- through problem solving is subject to fluctuations
two items demonstrated acceptable model fit. The Rasch as any other ability constructs. Due to these
model accounted for 26% of the variance in the responses fluctuations, the measure of person ability and
to the remaining items. The findings generally support the item difficulty needs to be calibrating in a
test’s validity. Finally, the results suggest to further
explores the dimensionality of problem solving as a
logistical Model. An analysis that offers this
construct. technique is using the one-parameter Rasch
Model.
Problem solving has a special
importance in the study of mathematics. A Research on Problem Solving
primary goal of mathematics teaching and
learning is to develop the ability to solve a wide Various research methodologies are
variety of complex mathematics problems. Stanic used in mathematics education research
and Kilpatrick (1988) traced the role of problem including a clinical approach that is frequently
solving in school mathematics and illustrated a used to study problem solving. Typically,
rich history of the topic. To many mathematically mathematical tasks or problem situations are
literate people, mathematics is synonymous with devised, and students are studied as they
solving problems-doing word problems, creating perform the tasks. Often they are asked to talk
patterns, interpreting figures, developing aloud while working or they are interviewed and
geometric constructions, proving theorems, etc. asked to reflect on their experience and
The rhetoric of problem solving has been especially their thinking processes. Waters
so pervasive in the mathematics education of the (1984) discusses the advantages and
1980s and 1990s that creative speakers and disadvantages of four different methods of
writers can put a twist on whatever topic or measuring strategy use involving a clinical
activity they have in mind to call it problem approach. Schoenfeld (1983) describes how a
solving. Every exercise of problem solving clinical approach may be used with pairs of
research has gone through some agony of students in an interview. He indicates that "dialog
defining mathematics problem solving. Reitman between students often serves to make
(1965) defined a problem as when you have managerial decisions overt, whereas such
been given the description of something but do decisions are rarely overt in single student
not yet have anything that satisfies that protocols."
107
The basis for most mathematics problem

solving research for secondary school students in Problem Solving as a Process
the past 31 years can be found in the writings of Garofola and Lester (1985) have
Polya (1973, 1962, 1965), the field of cognitive suggested that students are largely unaware of
psychology, and specifically in cognitive science. the processes involved in problem solving and
Cognitive psychologists and cognitive scientists that addressing this issue within problem solving
seek to develop or validate theories of human instruction may be important.
learning (Frederiksen, 1984) whereas
mathematics educators seek to understand how Domain Specific Knowledge. To become
their students interact with mathematics a good problem solver in mathematics, one must
(Schoenfeld, 1985; Silver, 1987). The area of develop a base of mathematics knowledge. How
cognitive science has particularly relied on effective one is in organizing that knowledge also
computer simulations of problem solving (25,50). contributes to successful problem solving.
If a computer program generates a sequence of Kantowski (1974) found that those students with
behaviors similar to the sequence for human a good knowledge base were most able to use
subjects, then that program is a model or theory the heuristics in geometry instruction. Schoenfeld
of the behavior. Newell and Simon (1972), Larkin and Herrmann (1982) found that novices
(1980), and Bobrow (1964) have provided attended to surface features of problems
simulations of mathematical problem solving. whereas experts categorized problems on the
These simulations may be used to better basis of the fundamental principles involved.
understand mathematics problem solving. Silver (1987) found that successful
Constructivist theories have received problem solvers were more likely to categorize
considerable acceptance in mathematics math problems on the basis of their underlying
education in recent years. In the constructivist similarities in mathematical structure. Wilson
perspective, the learner must be actively involved (1967) found that general heuristics had utility
in the construction of one's own knowledge rather only when preceded by task specific heuristics.
than passively receiving knowledge. The The task specific heuristics were often specific to
teacher's responsibility is to arrange situations the problem domain, such as the tactic most
and contexts within which the learner constructs students develop in working with trigonometric
appropriate knowledge (Steffe & Wood, 1990; identities to "convert all expressions to functions
von Glasersfeld, 1989). Even though the of sine and cosine and do algebraic
constructivist view of mathematics learning is simplification."
appealing and the theory has formed the basis
for many studies at the elementary level, Algorithms. An algorithm is a procedure,
research at the secondary level is lacking. applicable to a particular type of exercise, which,
However, constructivism is consistent with if followed correctly, is guaranteed to give you the
current cognitive theories of problem solving and answer to the exercise. Algorithms are important
mathematical views of problem solving involving in mathematics and our instruction must develop
exploration, pattern finding, and mathematical them but the process of carrying out an
thinking (Schoenfeld, 1988; Kaput, 1979; algorithm, even a complicated one, is not
National Council of Supervisors of Mathematics, problem solving. The process of creating an
1978) thus teachers are urged and teacher algorithm, however, and generalizing it to a
educators become familiar with constructivist specific set of applications can be problem
views and evaluate these views for restructuring solving. Thus problem solving can be
their approaches to teaching, learning, and incorporated into the curriculum by having
research dealing with problem solving. students create their own algorithms. Research
involving this approach is currently more
108
prevalent at the elementary level within the the role of teacher, and direct instruction to
context of constructivist theories. develop students' abilities to generate subgoals.
It is useful to develop a framework to
Heuristics. Heuristics are kinds of think about the processes involved in
information, available to students in making mathematics problem solving. Most formulations
decisions during problem solving, that are aids to of a problem solving framework in U. S.
the generation of a solution, plausible in nature textbooks attribute some relationship to Polya's
rather than prescriptive, seldom providing (1973) problem solving stages. However, it is
infallible guidance, and variable in results. important to note that Polya's "stages" were more
Somewhat synonymous terms are strategies, flexible than the "steps" often delineated in
techniques, and rules-of-thumb. For example, textbooks. These stages were described as
admonitions to "simplify an algebraic expression understanding the problem, making a plan,
by removing parentheses," to "make a table," to carrying out the plan, and looking back.
"restate the problem in your own words," or to According to Polya (1965), problem
"draw a figure to suggest the line of argument for solving was a major theme of doing mathematics
a proof" are heuristic in nature. Out of context, and "teaching students to think" was of primary
they have no particular value, but incorporated importance. "How to think" is a theme that
into situations of doing mathematics they can be underlies much of genuine inquiry and problem
quite powerful (Polya, 1973; Polya, 1962; Polya, solving in mathematics. However, care must be
1965). taken so that efforts to teach students "how to
Theories of mathematics problem solving think" in mathematics problem solving do not get
(Newell & Simon, 1972; Schoenfeld, 1985; transformed into teaching "what to think" or "what
Wilson, 1967) have placed a major focus on the to do." This is, in particular, a byproduct of an
role of heuristics. Surely it seems that providing emphasis on procedural knowledge about
explicit instruction on the development and use of problem solving as seen in the linear frameworks
heuristics should enhance problem solving of U. S. mathematics textbooks and the very
performance; yet it is not that simple. Schoenfeld limited problems/exercises included in lessons.
(1985) and Lesh (1981) have pointed out the Clearly, the linear nature of the models
limitations of such a simplistic analysis. Theories used in numerous textbooks does not promote
must be enlarged to incorporate classroom the spirit of Polya's stages and his goal of
contexts, past knowledge and experience, and teaching students to think. By their nature, all of
beliefs. What Polya (1967) describes in How to these traditional models have the following
Solve It is far more complex than any theories we defects:
have developed so far. 1. They depict problem solving as a
Mathematics instruction stressing linear process.
heuristic processes has been the focus of several 2. They present problem solving as a
studies. Kantowski (1977) used heuristic series of steps.
instruction to enhance the geometry problem 3. They imply that solving mathematics
solving performance of secondary school problems is a procedure to be memorized,
students. Wilson (1967) and Smith (1974) practiced, and habituated.
examined contrasts of general and task specific 4. They lead to an emphasis on answer
heuristics. These studies revealed that task getting.
specific hueristic instruction was more effective These linear formulations are not very
than general hueristic instruction. Jensen (1984) consistent with genuine problem solving activity.
used the heuristic of subgoal generation to They may, however, be consistent with how
enable students to form problem solving plans. experienced problem solvers present their
He used thinking aloud, peer interaction, playing solutions and answers after the problem solving
109
is completed. In an analogous way, McGuinness, 2003; Lai, Cella, Chang, Bode, &
mathematicians present their proofs in very Heinemann, 2003; Linacre, Heinemann, Wright,
concise terms, but the most elegant of proofs Granger, & Hamilton, 1994; Velozo, Magalhaes,
may fail to convey the dynamic inquiry that went Pan, & Leiter, 1995; Ware, Bjorner, & Kosinski,
on in constructing the proof. 2000) but has rarely been used in mathematical
Another aspect of problem solving that is problem solving assessment (Willmes, 1981,
seldom included in textbooks is problem posing, 1992). Its primary advantages include the interval
or problem formulation. Although there has been nature of the measures it provides and the
little research in this area, this activity has been theoretical independence of item difficulty and
gaining considerable attention in U. S. person ability scores from the particular samples
mathematics education in recent years. Brown used to estimate them.
and Walter (1983) have provided the major work The Rasch model, also referred to in the
on problem posing. Indeed, the examples and item response theory literature as the one-
strategies they illustrate show a powerful and parameter logistic model, estimates the
dynamic side to problem posing activities. Polya probability of a correct response to a given item
(1972) did not talk specifically about problem as a function of item difficulty and person ability.
posing, but much of the spirit and format of The primary output of Rasch analysis is a set of
problem posing is included in his illustrations of item difficulty and person ability values placed
looking back. along a single interval scale. Items with higher
A framework is needed that emphasizes difficulty scores are less likely to be answered
the dynamic and cyclic nature of genuine correctly, and items with lower scores are more
problem solving. A student may begin with a likely to elicit correct responses. By the same
problem and engage in thought and activity to token, persons with higher ability are more likely
understand it. The student attempts to make a to provide correct responses, and those with
plan and in the process may discover a need to lower ability are less likely to do so.
understand the problem better. Or when a plan Rasch analysis (a) estimates the
has been formed, the student may attempt to difficulty of dichotomous items as the natural
carry it out and be unable to do so. The next logarithm of the odds of answering each item
activity may be attempting to make a new plan, correctly (a log odds, or logit score), (b) typically
or going back to develop a new understanding of scales these estimates to mean = 0, and then (c)
the problem, or posing a new (possibly related) estimates person ability scores on the same
problem to work on. scale. In analysis of dichotomous items, item
Problem solving abilities, beliefs, difficulty and person ability are defined such that
attitudes, and performance develop in contexts when they are equal, there is a 50% chance of a
(Schoenfeld, 1988) and those contexts must be correct response. As person ability exceeds item
studied as well as specific problem solving difficulty, the chance of a correct response
activities. increases as a logistic ogive function, and as
item difficulty exceeds person ability, the chance
Rasch Analysis of success decreases. The formal relationship
Rasch analysis (Bond & Fox, 2001; among response probability, person ability, and
Rasch, 1980; Wright & Stone, 1979) offers item difficulty is given in the mathematical
potential advantages over the traditional equation by Bond and Fox (2001, p. 201). A
psychometric methods of classical test theory. It graphic plot of this relationship, known as the
has been widely applied in health status item characteristic curve (ICC), is given for three
assessment (e.g., Antonucci, Aprile, & Paulucci, items of different difficulty levels.
2002; Duncan, Bode, Lai, & Perera, 2003; One useful feature of the Rasch model is
Fortinsky, Garcia, Sheenan, Madigan, & Tullai- referred to as parameter separation or specific
110
objectivity (Bond & Fox, 2001; Embretson & requires that individual items do not influence
Reise, 2000). The implication of this one another (i.e., they are uncorrelated, once the
mathematical property is that, at least in theory, dimension of item difficulty-person ability is taken
item difficulty values do not depend on the into account). Thus, no considerations of item
person sample used to estimate them, nor do content, beyond their difficulty values, are
person ability scores depend on the particular necessary for estimating person ability, and
items used to estimate them. In practical terms, changing the order of item administration should
this means that given well-calibrated sets of not change item or person estimates. In
items that fit the Rasch model, robust and directly mathematical terms, this assumption states that
comparable ability estimates may be obtained the probability of a string of responses is equal to
from different subsets of items. This, in turn, the product of the individual probabilities of each
facilitates both adaptive testing and the equating of the separate responses comprising it. Failure
of scores obtained from different instruments to meet this assumption can suggest the
(Bond & Fox, 2001; Embretson & Reise, 2000). presence of another dimension in the data.
Rasch theory makes a number of explicit Local dependence is often a concern in
assumptions about the construct to be measured the construction of reading comprehension tests
and the items used to measure it, two of which that include multiple questions about the same
have already been discussed above. The first is passage, because responses to such questions
that all test items respond to the same may be determined not only by the difficulty of
unidimensional construct. One set of tools for each individual item but also by the difficulty and
examining the extent to which test items content of the passage. Responses to items of
approximate unidimensionality are the fit this type are often intercorrelated even after their
statistics provided by Rasch analysis. These fit individual difficulties have been taken into
statistics indicate the amount of variation account. To give another example, if a particular
between model expectations and observations. question occurring earlier in a test provides
They identify items and people eliciting specific information about the answer to a later
unexpected responses, such as when a person question, then these two items are also likely to
of high ability responds incorrectly to an easy demonstrate local dependence.
question, perhaps because of carelessness or A final important assumption of the
because of a poorly constructed or administered Rasch model is that the slope of the item
item. Fit statistics can be informative with respect characteristic curve, also known as the item
to dimensionality because they indicate when discrimination parameter, is equal to 1 for all
different people may be responding to different items (Bond & Fox, 2001; Embretson & Reise,
aspects of an item's content or the testing 2000; Wainer & Mislevy, 2000). This assumption
situation. is presented graphically in Figure 1, where all
A second key assumption of Rasch analysis, also three curves are parallel with a slope equal to 1.
mentioned above, is that individuals can be The consequence of this assumption is that a
placed on an ordered continuum along the given change in ability level will have the same
dimension of interest, from those having less effect on the log odds of a correct response for
ability to those having more (Bond & Fox, 2001). all items. Items that have different discrimination
Similarly, the analysis assumes that items may values, a given change in ability has different
be placed on the same scale, from those consequences for different items. When an item's
requiring less ability to those requiring more. discrimination parameter is high, a relatively
A third assumption underlying Rasch small change in ability level results in a large
analysis is that of local, or conditional, change in response probability. When
independence (Embretson & Reise, 2000; discrimination is low, larger changes in ability
Wainer & Mislevy, 2000). This assumption level are needed to change response probability.
111
A highly discriminating item (i.e., one with a high Problem Solving data provided by a sample of
ICC slope) is more likely to result in different fourth year high school students in two Chinese
responses from two individuals of different ability Schools. One purpose of the study was to
levels, whereas an item with a low discrimination determine whether the construct validity of the
parameter (i.e. a low ICC slope) more often test is supported by Rasch analysis. Specifically,
results in the same response from both. Rasch it is hypothesized that the test responds to a
models have been shown to be robust to small cohesive unidimensional construct. Item fit
and/or unsystematic violations of this assumption statistics, a Rasch-based unidimensionality
(Penfield, 2004; van de Vijver, 1986), but when coefficient, and principal-components analysis of
the ICC slopes in an item set differ substantially model residuals were used to evaluate this
and/or systematically from 1, the test developer hypothesis.
is advised to reconsider the extent to which the 2. To test the hypothesis that Rasch
offending items measure the relevant construct estimates of person ability, because of their
(Wright, 1991). status as interval-level measures, are more valid
An example ion the use of the one- and sensitive than traditionally computed scores.
parameter Rasch Model is the study by
El-Korashy (1995) where the Rasch Model was Method
applied to the selection of items for an Arabic Participants
version of the Otis-Lennon Mental Ability Test.
Correspondence of item calibration to person The participants were 31 high school
measurement indicated that the test is suitable students from two different schools. The two
for the range of mental ability intended to be high schools are UNO High School and Grace
measured. Another is the study by Lamprianou Christian High School. These two high schools
(2004) that analyzes data from three testing were chosen for their popularity in molding high
cycles of the National Curriculum tests in achievers in Mathematics. The participants were
mathematics in England using the Rasch model. fourth year high school students, both male and
It was found that pupils having English as an female students and belonging to the 16-18 age
additional language and pupils belonging to group. The decision to choose high school
ethnic minorities are significantly more likely to students was made because the high school
generate aberrant response patterns. However, educational system was much more regimented,
within the groups of pupils belonging to ethnic and it can be safely assumed that any given
minorities, those who speak English as an fourth year student would have studied the
additional language are not significantly more lessons required of a third year student.
likely to generate misfitting response patterns. Convenient sampling was used to select the
This may indicate that the ethnic background respondents.
effect is more significant than the effect of the
first language spoken. The results suggest that Instrument
pupils having English as an additional language
and pupils belonging to ethnic minorities are Mathematical Problem Solving Test. The
mismeasured significantly more than the Mathematical Problem Solving test was
remainder of pupils by taking the mathematics constructed to measure the problem solving
National Curriculum tests. More research is ability of the students (seer Appendix A). There
needed to generalize the results to other subjects are 25 items included in the test that covers third
and contexts. year high school lessons. Third year lessons
Purpose of the Study were used because the participants will only be
1. In the current investigation, the Rasch starting their fourth year in high school, and might
model was used to analyze a set Mathematiocal not have enough knowledge of fourth year math.
112
The coverage of the test includes fractions, Procedure

factoring, simple algebraic equations and various
word problems. These factors are based on the The Mathematical Problem Solving Test
Merle S. Alferez (MSA) Review Questions for All was administered to fourth Year High school
College Entrance Test (ACET) and University of students of two Chinese Schools in Manila. A
the Philippines College Admissions Test letter requesting to administer the test was sent
(UPCAT), and the College Entrance Test to the Math teacher. The mathematics teacher
Reviewer third edition. was given detailed instructions on how to
A professor from the Mathematics administer the test. A copy of the instructions to
Department of De La Salle University-Manila was be given to the students were provided so that
asked to critique the items in the Mathematical the administration would be constant across
Problem Solving Test. The item reviewer was situations. After administering the test the
given a copy of the Table of Specifications. This students and teachers were debriefed about the
table served to orient about the nature of the purpose of the study.
items used in the test. The proponent then
explained the purpose of the test in order to Data Analysis
revise the items to better fulfill the objectives of
the exam. After the mathematical problem To describe the distribution of the scores,
solving test was revised, it was pre-tested on 10 the mean, standard deviation, kurtosis, and
high school students from Saint Jude Catholic skewness were obtained. The reliability of the
School to determine the length of time needed by items were evaluated using the Kuder
students in answering the entire test. Richardson #20.
The Mathematical Problem Solving Test Item Analysis was conducted using both
was then given to 31 fourth year high school Classical Test Theory (CTT) and Item Response
students for pilot testing. The data from the pilot Theory (IRT). In the CTT the item difficulty and
testing were used for reliability and item analysis. item discrimination were determined using the
The Kuder-Richardson reliability was used to proportion of the high group and the low group.
determine the internal consistency of the items of Item difficulty is determined by getting the
the Mathematical Problem Solving Test. This average proportion of correct responses between
method was used to be able to find the the high group and low group. The Item
consistency of the responses on all the items in discrimination is determined by computing for the
the test. The test has an internal consistency of difference between the high group and the low
.84 based on the KR #20. The skewness of the group. The estimation of Rasch item difficulty and
distribution of scores is somehow negatively person ability scores and related analyses were
skewed with a value of -.158. The distribution of carried out using WINSTEPS. This software
scores has a kurtosis of -1.05. The overall mean package begins with provisional central
of the test performance of the participants in the estimates of item difficulty and person ability
pilot test is 16.23 with a standard deviation of parameters, compares expected responses
standard deviation of 5.45. This shows that based on these estimates to the data, constructs
scores of 17 to 25 are high in problem solving new parameter estimates using maximum
and a score of 15 and below are below average. likelihood estimation, and then reiterates the
A standard deviation of 5.45 means that the analysis until the change between successive
individual scores are dispersed. iterations is small enough to satisfy a preselected
criterion value. The item parameter estimates are
typically scaled to have M = 0, and person ability
scores are estimated in reference to the item
mean. A unit on this scale, a logit, represents the
113
change in ability or difficulty necessary to change answered with low ability while items 3, 7 and 2
the odds of a correct response by a factor of requires higher ability to get a correct response.
2.718, the base of the natural logarithm. Persons The characteristic curve shows that
who respond to all items correctly or incorrectly, Items 11, 13, 2 and 8 have the probability of
and items to which all persons respond correctly being answered with low ability while items 9, 10
or incorrectly, are uninformative with respect to and 4 requires higher ability to get a correct
item difficulty estimation and are thus excluded response. The overlap between items 11 and 13
from the parameter estimation process. and items 9 and 109 means that the same ability
are required to get the probability of answering
the item correct.
Results The characteristic curve shows that
Items 16 and 19 have the probability of being
Item analysis was used to evaluate answered with low ability while items 17, 18, 20,
whether the items in the Mathematical Problem 21 and 22 requires higher ability to get a correct
Solving Test are easy, average or difficult. The response. The overlap between items 18, 20, 21,
difficulty of an item is based on the percentage of and 22, and items 16 and 19 means that the
people who answered it correctly. The index same ability are required to get the probability of
discrimination revealed that there are no answering the item correct. Items 23, 24 and 25
marginal items as well as bad items; however, are excluded because of extreme responses.
84% of the items are very good, 2% are good
items and 2% are reasonably good items. Examination of Fit
In the item difficulty, the each item
indicates whether it is easy, average or difficult. The average INFIT statistics is 1.00 and
Item difficulty is determined if the items have the average OUTFIT statistics is .98 which indicates
appropriate difficulty level. It was found out that that the data for the items are showing goodness
there are no difficult items presented, although of fit because the value is less than 1.5 except for
72% of the items are average and 28% are easy. items 23, 24 and 25.
Unidimensionality Coefficient
One Parameter-Rasch Model To address the question of construct
When the test scores and ability of the dimensionality, a Rasch unidimensionality
students in the Mathematical Problem Solving coefficient was calculated. This coefficient was
Test was calibrated new indices for the reliability calculated as the ratio of the person separation
was obtained. The student reliability was .50 with reliability estimated using model standard errors
a RMSE of .52 and the Math reliability is .34 with (which treat model misfit as random variation) to
an RMSE of .82. The errors associated with the person separation reliability estimated using
these estimates are high indicating that the data real standard errors (which regard misfit as true
does not fit well the expected ability and test departure from the unidimensional model; Wright,
difficulty. Figure 1 shows the test characteristic 1994). The closer the value of the coefficient to
curve generated by the WINSTEPS. 1.0, the more closely the data approximate
In the computed separation for ability is unidimensionality. The unidimensionality
1.20 and the item (expected score) is 11 which is coefficient for the current data set was .61 (ratio
.73 when converted into a standardized estimate. of 1.20 and .73 separation values) which is quite
Although these extreme values are adjusted by marginal to 1.00. This means that the data might
fine tuning the slopes produced for each item. form dimensions.
The characteristic curve shows that Principal Components analysis shows
Items 5, 1, 6 and 4 have the probability of being that there can possibly be 7 factors that can be
114
formed with the items excluding item 25 with no because of the additional lexical load imposed by
variation as indicated in the scree plot. the inclusion of size adjectives.
Principal-components analysis of model Aspects of the tests validity were
residuals conducted for the 24-item pool (after supported by the present analyses. First, two
exclusion of the seven misfitting items) revealed items were only excluded because of poor model
that 26.97% of the variance in the observations fit. Perhaps participants were not generally able
was accounted for by the Rasch dimension of to figure out the proper response strategy by the
item difficulty-person ability. The next largest end of the test (because of the provision of
factor extracted accounted for only 4.86% of the repeats and cues) and were then able to
remaining variance. effectively implement problem solving strategies.
The log functions for each item shows If this is correct, then eliminating these items
large standard errors. This supports the principal should introduce misfit for the items of this type.
components analysis that there might be factors The two other items that were excluded
formed out of the 22 items. because of poor model fit were the last test item,
which differs from the earlier items in that it
Discussion contains two-part commands and requires
responses using more skills. This suggests that
The present results generally support the initial responses to different kinds of commands
construct and content validity of the Mathematical might be determined in part by another construct,
problem solving test. First, the acceptable fit of for example, ability to switch set.
the 22 test items to the Rasch model and the A second aspect of the test’s validity that
marginal unidimensionality coefficient (.61) the present analysis failed to confirm concerns
support the hypothesis that the RTT measures a the homogeneity of item difficulty within subtests.
unidimensional construct. Furthermore, The differences between the parameter
acceptable item and person separation indices estimates within the items suggest that they are
and reliability coefficients suggest that the not necessarily homogeneous with respect to
parameter estimates obtained in the current difficulty. The present finding might have been in
study are both reproducible and useful for part the result of a relatively small and poorly
differentiating items and persons from one targeted sample. A larger sample with a broader
another. distribution might obtain less item variability.
In addition, principal-components Although sample sizes of approximately 100
analysis of Rasch model residuals (with the two have been argued to produce stable item
misfitting items excluded) indicated that the parameter estimates (Linacre, 1994; van de
dimension of person ability-item difficulty Vijver, 1986), larger samples are preferable.
accounted for the majority of the variance in the Willmes's (1981) prior finding suggests that the
data (26.97%) and the next largest factor present result may be reliable, but his participant
extracted accounted for very little additional sample was similarly sized, if perhaps better
variance (4.86%). Although this does not provide targeted.
further support for the unidimensionality of the
test. References
The pattern of item difficulty across Andrich, D. (2004). Controversy and the Rasch model: A
subtests was consistent with item content and characteristic of incompatible paradigms? Medical
Care, 41, 17-116.
similar for values derived by Rasch analysis and Antonucci, G., Aprile, T., & Paulucci, S. (2002). Rasch
traditional methods. As expected, based on analysis of the Rivermead Mobility Index: A study
increasing lexical load, the results showed using mobility measures of first-stroke inpatients.
variation in the difficulty. There are more items Archives of Physical Medicine and Rehabilitation, 83,
that can be answered requiring low ability 1442-1449.
115
Arvedson, J. C., McNeil, M. R., & West, T. L. (1986). of Rasch modeling to the Outcome and Assessment
Prediction of Revised Token Test overall, subtest, and Information Set. Medical Care, 41, 601-615.
linguistic unit scores by two shortened versions. Frederiksen, N. (1984). Implications of cognitive theory for
Clinical Aphasiology, 16, 57-63. instruction in problem solving. Review of Educational
Blackwell, A., & Bates, E. (1995). Inducing agrammatic Research, 54, 363-407.
profiles in normals: Evidence for the selective Freed, D. B., Marshall, R. C., & Chulantseff, E. A. (1996).
vulnerability of morphology under cognitive resource Picture naming variability: A methodological
limitation. Journal of Cognitive Neuroscience, 7, 228- consideration of inconsistent naming responses in
257. fluent and nonfluent aphasia. In R. H. Brookshire
Bobrow, D. G. (1964). Natural language input for a (Ed.), Clinical aphasiology conference (pp. 193-205).
computer problem solving system. Unpublished Austin, TX: Pro-Ed.
doctoral dissertation, Massachusetts Institute of Garfola, J. & Lester, F. K. (1985). Metacognition, cognitive
Technology, Boston. monitoring, and mathematical performance. Journal
Bond, T. G., & Fox, C. M. (2001). Applying the Rasch for Research in Mathematics Education, 16, 163-176.
model: Fundamental measurement in the human Guilford, J. P. (1954). Psychometric methods. New York:
sciences. Mahwah, NJ: Erlbaum. McGraw-Hill.
Briggs, D. C., & Wilson, M. (2003). An introduction to Henderson, K. B. & Pingry, R. E. (1953). Problem solving
multidimensional measurement using Rasch models. in mathematics. In H. F. Fehr (Ed.), The learning of
Journal of Applied Measurement, 4, 87-100. mathematics: Its theory and practice (21st Yearbook
Brown, S. I. & Walter, M. I. (1983). The art of problem of the National Council of Teachers of Mathematics)
posing. Hillsdale, NJ: Lawrence Erlbaum. (pp. 228-270). Washington, DC: National Council of
Chang, W-C., & Chan, C. (1995). Rasch analysis for Teachers of Mathematics.
outcomes measures: Some methodological Hobart, J. C. (2002). Measuring disease impact in disabling
considerations. Archives of Physical Medicine and neurological conditions: Are patients' perspectives and
Rehabilitation, 76, 934-939. scientific rigor compatible? Current Opinions in
Cliff, N. (1992). Abstract measurement theory and the Neurology, 15, 721-724.
revolution that never happened. Psychological Howard, D., Patterson, K., Franklin, S., Morton, J., &
Science, 3, 186-190. Orchard-Lisle, V. (1984). Variability and consistency in
DiSimoni, F. G., Keith, R. L., & Darley, F. L. (1980). naming by aphasic patients. Advances in Neurology,
Prediction of PICA overall score by short versions of 42, 263-276.
the test. Journal of Speech and Hearing Research, 23, Jensen, R. (1984). A multifaceted instructional approach
511-516. for developing subgoal generation skills. Unpublished
Duffy, J. R., & Dale, B. J. (1977). The PICA scoring scale: doctoral dissertation, The University of Georgia.
Do its statistical shortcomings cause clinical Kahneman, D. (1973). Attention and effort. Englewood
problems? In R. H. Brookshire (Ed.), Collected Cliffs, NJ: Prentice-Hall.
proceedings from clinical aphasiology (pp. 290-296). Kantowski, M. G. (1974). Processes involved in
Minneapolis, MN: BRK. mathematical problem solving. Unpublished doctoral
Duncan, P. W., Bode, R., Lai, S. M., & Perera, S. (2003). dissertation, The University of Georgia, Athens.
Rasch analysis of a new stroke-specific outcome Kantowski, M. G. (1977). Processes involved in
scale: The Stroke Impact Scale. Archives of Physical mathematical problem solving. Journal for Research in
Medicine and Rehabilitation, 84, 950-963. Mathematics Education, 8, 163-180.
Efron, B., & Tibshirani, R. (1986). Bootstrap methods for Kaput, J. J. (1979). Mathematics learning: Roots of
standard errors, confidence intervals, and other epistemological status. In J. Lochhead and J. Clement
measures of statistical accuracy. Statistical Science, (Eds.), Cognitive process instruction. Philadelphia,
1, 54-77. PA: Franklin Institute Press.
El-Korashy, A. (1995). Applying the Rasch model to the Lai, J-S., Cella, D., Chang, C. H., Bode, R., & Heinemann,
selection of items for a mental ability test. Educational A. W. (2003). Item banking to improve, shorten and
and Psychological Measurement, 55, 753. computerize self-reported fatigue: An illustration of
Embretson, S. E., & Reise, S. P. (2000). Item response steps to create a core item bank from the FACIT-
theory for psychologists. Mahwah, NJ: Erlbaum. Fatigue Scale. Quality of Life Research, 12, 485-501.
Fischer, G. H., & Molenaar, I. W. (1995). Rasch models: Lamprianou, I. & Boyle, B. (2004). Accuracy of
Foundations, recent developments and applications. Measurement in the Context of Mathematics National
New York: Springer. Curriculum Tests in England for Ethnic Minority Pupils
Fortinsky, R. H., Garcia, R. I., Sheenan, T. J., Madigan, E. and Pupils Who Speak English as an Additional
A., & Tullai McGuinness, S. (2003). Measuring Language. JEM, 41, 239-251.
disability in Medicare home care patients: Application
116
Larkin, J. (1980). Teaching problem solving in physics: The Merbitz, C., Morris, J., & Grip, J. C. (1989). Ordinal scales
psychological laboratory and the practical classroom. and foundations of misinference. Archives of Physical
In F. Reif & D. Tuma (Eds.), Problem solving in Medicine and Rehabilitation, 70, 308-312.
education: Issues in teaching and research. Hillsdale, Michell, J. (1990). An introduction to the logic of
NJ: Lawrence Erlbaum. psychological measurement. Hillsdale, NJ: Erlbaum.
Lesh, R. (1981). Applied mathematical problem solving. Michell, J. (1997). Quantitative science and the definition of
Educational Studies in Mathematics, 12(2), 235-265. measurement in psychology. British Journal of
Linacre, J. M. (1994). Sample size and item calibration Psychology, 88, 355-383.
stability. Rasch Measurement Transactions, 7, 328. Michell, J. (2004). Item response models, pathological
Linacre, J. M. (1998). Structure in Rasch residuals: Why science, and the shape of error. Theory and
principal components analysis? Rasch Measurement Psychology, 14, 121-129.
Transactions, 12, 636. National Council of Supervisors of Mathematics. (1978).
Linacre, J. M. (2002). Facets, factors, elements and levels. Position paper on basic mathematical skills.
Rasch Measurement Transactions, 16, 880. Mathematics Teacher, 71(2), 147-52. (Reprinted from
Linacre, J. M., & Wright, B. D. (1994). Reasonable mean- position paper distributed to members January 1977.)
square fit values. Rasch Measurement Transactions, Newell, A. & Simon, H. A. (1972). Human problem solving.
8, 370. Englewood Cliffs, NJ: Prentice Hall.
Linacre, J. M., & Wright, B. D. (2003). WINSTEPS: Norquist, J. M., Fitzpatrick, R., Dawson, J., & Jenkinson, C.
Multiple-choice, rating scale, and partial credit Rasch (2004). Comparing alternative Rasch-based methods
analysis [Computer software]. Chicago: MESA Press. vs. raw scores in measuring change in health. Medical
Linacre, J. M., Heinemann, A. W., Wright, B., Granger, C. Care, 42, 125-136.
V., & Hamilton, B. B. (1994). The structure and Nunnally, J. C., & Bernstein, I. H. (1994). Psychometric
stability of the Functional Independence Measure. theory (3rd ed.). New York: McGraw-Hill.
Archives of Physical Medicine and Rehabilitation, 75, Orgass, B. (1976). Eine Revision des Token Tests, Teil I
127-132. und II [A revision of the token tests, Part I and II].
Lord, F. M., Novick, M. R., & Birnbaum, A. (1968). Diagnostica, 22, 70-87.
Statistical theories of mental test scores. Reading, Penfield, R. D. (2004). The impact of model misfit on partial
MA: Addison-Wesley. credit model parameter estimates. Journal of Applied
Luce, R. D., & Tukey, J. W. (1964). Simultaneous conjoint Measurement, 5, 115-128.
measurement: A new type of fundamental Polya, G. (1962). Mathematical discovery: On
measurement. Journal of Mathematical Psychology, 1, understanding, learning and teaching problem solving
1-27. (vol. 1). New York: Wiley.
Lumsden, J. (1978). Tests are perfectly reliable. British Polya, G. (1965). Mathematical discovery: On
Journal of Mathematical and Statistical Psychology, understanding, learning and teaching problem solving
31, 19-26. (vol. 2). New York: Wiley.
Masters, G. (1993). Undesirable item discrimination. Rasch Polya, G. (1973). How to solve it. Princeton, NJ: Princeton
Measurement Transactions, 7, 289. University Press. (Originally copyrighted in 1945).
McHorney, C. A., Haley, S. M., & Ware, J. E. (1997). Porch, B. (2001). Porch Index of Communicative Ability.
Evaluation of the MOS SF-36 physical functioning Albuquerque, NM: PICA Programs.
scale (PF-10): II. Comparison of relative precision Rasch, G. (1980). Probabilistic models for some
using Likert and Rasch scoring methods. Journal of intelligence and attainment tests. Chicago: University
Clinical Epidemiology, 50, 451-461. of Chicago Press. (Original work published 1960)
McNeil, M. R. (1988). Aphasia in the adult. In N. J. Lass, L. Reitman, W. R. (1965). Cognition and thought. New York:
V. McReynolds, J. Northern, & D. E. Yoder (Eds.), Wiley.
Handbook of speech-language pathology and Schoenfeld, A. H. (1983). Episodes and executive
audiology (pp. 738-786). Toronto, Ontario, Canada: D. decisions in mathematics problem solving. In R. Lesh
C. Becker. & M. Landau, Acquisition of mathematics concepts
McNeil, M. R., & Hageman, C. F. (1979). Auditory and processes. New York: Academic Press
processing deficits in aphasia evidenced on the Schoenfeld, A. H. (1985). Mathematical problem solving.
Revised Token Test: Incidence and prediction of Orlando, FL: Academic Press.
across subtest and across item within subtest Schoenfeld, A. H. (1988). When good teaching leads to
patterns. In R. H. Brookshire (Ed.), Clinical bad results: The disasters of "well taught"
aphasiology conference proceedings (pp. 47-69). mathematics classes. Educational Psychologist, 23,
Minneapolis, MN: BRK. 145-166.
Schoenfeld, A. H., & Herrmann, D. (1982). Problem
perception and knowledge structure in expert and
117
novice mathematical problem solvers. Journal of Waters, W. (1984). Concept acquisition tasks. In G. A.
Experimental Psychology: Learning, Memory and Goldin & C. E. McClintock (Eds.), Task variables in
Cognition, 8, 484-494. mathematical problem solving (pp. 277-296).
Segall, D. O. (1996). Multidimensional adaptive testing. Philadelphia, PA: Franklin Institute Press.
Psychometrika, 61, 331-354. Willmes, K. (1981). A new look at the Token Test using
Silver, E. A. (1987). Foundations of cognitive theory and probabilistic test models. Neuropsychologia, 19, 631-
research for mathematics problem-solving instruction. 645.
In A. H. Schoenfeld (Ed.), Cognitive science and Willmes, K. (1992). Psychometric evaluation of
mathematics education (pp. 33-60). Hillsdale, NJ: neuropsychological test performances. In N. von
Lawrence Erlbaum. Steinbuechel, D. Y. Cramon, & E. Poeppel (Eds.),
Smith, J. P. (1974). The effects of general versus specific Neuropsychological rehabilitation (pp. 103-113).
heuristics in mathematical problem-solving tasks Heidelberg, Germany: Springer-Verlag.
(Columbia University, 1973). Dissertation Abstracts Willmes, K. (2003). Psychometric issues in aphasia therapy
International, 34, 2400A. research. In I. Papathanasiou & R. De Bleser (Eds.),
Smith, R. M. (1986). Person fit in the Rasch model. The sciences of aphasia: From theory to therapy (pp.
Educational and Psychological Measurement, 46, 227-244). Amsterdam: Pergamon.
359-372. Wilson, J. W. (1967). Generality of heuristics as an
Stanic, G., & Kilpatrick, J. (1988). Historical Perspectives instructional variable. Unpublished Doctoral
on Problem Solving in the Mathematics Curriculum. In Dissertation, Stanford University, San Jose, CA.
R. I. Charles & E. A. Silver (Eds.), The teaching and Wright, B. D. (1991). IRT in the 1900's: Which models work
assessing of mathematical problem solving (pp. 1-22). best? Rasch Measurement Transactions, 6, 196-200.
Reston, VA: National Council of Teachers of Wright, B. D. (1994). A Rasch unidimensionality coefficient.
Mathematics. Rasch Measurement Transactions, 8, 385.
Steffe, L. P., & Wood, T. (Eds.). (1990). Transforming Wright, B. D. (1996). Local dependency, correlations and
Children's Mathematical Education. Hillsdale, NJ: principal components. Rasch Measurement
Lawrence Erlbaum. Transactions, 10, 509-511.
Stevens, S. S. (1946, June 7). On the theory of scales of Wright, B. D. (1999). Fundamental measurement for
measurement. Science, 103, 677-680. psychology. In S. E. Embretson & S. L. Hershberger
van de Vijver, F. J. R. (1986). The robustness of Rasch (Eds.), The new rules of measurement: What every
estimates. Applied Psychological Measurement, 10, psychologist and educator should know (pp. 65-104).
45-57. Mahwah, NJ: Erlbaum.
Velozo, C. A., Magalhaes, L. C., Pan, A.-W., & Leiter, P. Wright, B. D., & Linacre, J. M. (1989). Observations are
(1995). Functional scale discrimination at admission always ordinal; measurements, however, must be
and discharge: Rasch analysis of the Level of interval. Archives of Physical Medicine and
Rehabilitation Scale-III. Archives of Physical Medicine Rehabilitation, 70, 857-860.
and Rehabilitation, 76, 705-712. Wright, B. D., & Masters, G. S. (1982). Rating scale
von Glasersfeld, E. (1989). Constructivism in education. In analysis. Chicago: MESA Press.
T. Husen & T. N. Postlethwaite (Eds.), The Wright, B. D., & Stone, M. H. (1979). Best test design.
international encyclopedia of education. (pp. 162-163). Chicago: MESA Press.
(Suppl. Vol. I). New York: Pergammon. Wright, B., & Masters, G. (1997). The partial credit model.
Wainer, H., & Mislevy, R. J. (2000). Item response theory, In W. van der Linden & R. Hambleton (Eds.),
item calibration, and proficiency estimation. In H. Handbook of modern item response theory (pp. 101-
Wainer, N. J. Dorans, D. Eignor, R. Flaugher, B. F. 121). New York: Springer.
Green, & R. J. Mislevy, et al. (Eds.), Computerized
adaptive testing: A primer (2nd ed., pp. 61-100).
Mahwah, NJ: Erlbaum.
Wainer, H., Dorans, N. J., Eignor, D., Flaugher, R., Green,
B. F., Mislevy, R. J., et al. (2000). Computerized
adaptive testing: A primer (2nd ed.). Mahwah, NJ:
Erlbaum.
Ware, J. E., Bjorner, J. B., & Kosinski, M. (2000). Practical
implications of item response theory and
computerized adaptive testing: A brief summary of
ongoing studies of widely used headache impact
scales. Medical Care, 38, II73-II82.
118
SPECIAL TOPIC
A Review of Psychometric Theory

Carlo Magno
This special topic presents the nature of psychometrics including the issues on psychological
measurement, its relevant theories and its current practice. The basic scaling models are discussed since it
is a process enabling quantification of psychological constructs. The issues and research trends in classical
test theory and item response theory with its different models and their implication on test construction are
explained
The Nature of Psychometrics and Issues in Measurement
Psychometrics concerns itself with the science of measuring psychological constructs such as
ability, personality, affect and skills. Psychological measurement methods are crucially important for basic
research in psychology. Research in psychology involves the measurement of variables in order to conduct
further analysis. In the past, obtaining adequate measurement on psychological constructs is considered an
issue in the science of psychology. Some references indicate that there are psychological constructs that
are deemed to be unobservable and is difficult to quantify. This issue is carried over by the fact that
psychological theories are filled with variables that either cannot be measured at all at the present time or
can be measured only approximately (Kaplan & Saccuzzo, 1997) such as anxiety, creativity, dogmatism
achievement, motivation, attention and frustration. Moreover according to Emmanuel Kant that “it is
impossible to have a science of psychology because the basic data could not be observed and measured.”
Although the field of psychological measurement have been advanced and practitioners in the field of
psychometrics were able to properly deal with issues and devise methods on the basic premise of scientific
observation and measurement. Since most psychological constructs involves subjective experiences such
as feelings sensations and desires – and when individuals makes a judgment, state their preferences and
even talk about these experiences, then it is possible for measurement to take place and thus it meets the
requirements of scientific inquiry. It is very much possible to assign numbers to psychological constructs as
to represent quantities of attributes and even formulate rules of standardizing the measurement process.
In the process of standardizing psychological measurement, it requires a process of abstraction
where psychological attributes are observed in relation to other constructs such as attitude and
achievement (Magno, 2003). This process allows to establish the association among variables such as
construct validation and criterion-predictive processes. Also, emphasizing measurement of psychological
constructs forces researchers and test developers to consider carefully the nature of the construct before
attempting to measure it. This involves a thorough literature review on the conceptual definition of an
attribute before constructing valid items of a test. It is also a common practice in psychometrics where
numerical scores are used to communicate the amount of an attribute of an individual. Quantification is so
much intertwined with the concept of measurement. In the process of quantification, mathematical systems
and statistical procedures are used enabling to examine the internal relationship among data obtained
through a measure. Such procedures enable psychometrics to build theories considering itself part of the
system of science.
119
Branches of Psychometric Theory
There are two branches of psychometric theory: The classical test theory and the items response
theory. Both theories enable to predict outcomes of psychological tests by identifying parameters of item
difficulty and the ability of test takers. Both are concerned to improve the reliability of psychological tests.
Classical Test Theory
Classical test theory in references is regarded as the “true score theory.” The theory starts from the
assumption that systematic effects between responses of examinees are due only to variation in ability of
interest. All other potential sources of variation existing in the testing materials such as external conditions
or internal conditions of examinees are assumed either to be constant through rigorous standardization or
to have an effect that is nonsystematic or random by nature (van der Linden & Hambleton, 2004). The
central model of the classical test theory is that observed test scores (TO) are composed of a true score (T)
and an error score (E) where the true and the error scores are independent. The variables are established
by Spearman (1904) and Novick (1966) and best illustrated in the formula:
TO = T + E
The classical theory assumes that each individual has a true score that would be obtained if there
were no errors in measurement. However, because measuring instruments are imperfect, the score
observed for each person may differ from an individual’s true ability. The difference between the true score
and the observed test score results from measurement error. Using a variety of justifications, Error is often
assumed to be a random variable having a normal distribution. The implication of the classical test theory
for test takers is that tests are fallible imprecise tools. The score achieved by an individual is rarely the
individual’s true score. This means that the true score for an individual will not change with repeated
applications of the same test. This observed score is almost always the true score influenced by some
degree of error. This error influences the observed to be higher or lower. Theoretically, the standard
deviation of the distribution of random errors for each individual tells about the magnitude of measurement
error. It is usually assumed that the distribution of random errors will be the same for all individuals.
Classical test theory uses the standard deviation of errors as the basic measure of error. Usually this is
called the standard error of measurement. In practice, the standard deviation of the observed score and the
reliability of the test are used to estimate the standard error of measurement (Kaplan & Saccuzzo, 1997).
The larger the standard error of measurement, the less certain is the accuracy with which an attribute is
measured. Conversely, small standard error of measurement tells that an individual score is probably close
to the true score. The standard error of measurement is calculated with the formula:
Sm = S 1 − r
Standard errors of measurement are used to create confidence intervals around specific observed scores
(Kaplan & Saccuzzo, 1997). The lower and upper bound of the confidence interval approximates the value
of the true score.
Traditionally, methods of analysis based on classical test theory have been used to evaluate such
tests. The focus of the analysis is on the total test score; frequency of correct responses (to indicate
question difficulty); frequency of responses (to examine distractors); reliability of the test and item-total
correlation (to evaluate discrimination at the item level) (Impara & Plake, 1997). Although these statistics
120
have been widely used, one limitation is that they relate to the sample under scrutiny and thus all the
statistics that describe items and questions are sample dependent (Hambelton, 2000). This critique may
not be particularly relevant where successive samples are reasonably representative and do not vary
across time, but this will need to be confirmed and complex strategies have been proposed to overcome
this limitation.
Item Response Theory
Another branch of psychometric theory is the item response theory (IRT). IRT may be regarded as
roughly synonymous with latent trait theory. It is sometimes referred to as the strong true score theory or
modern mental test theory because IRT is a more recent body of theory and makes stronger assumptions
as compared to classical test theory. This approach to testing based on item analysis consider the chance
of getting particular items right or wrong. In this approach, each item on a test has its own item
characteristic curve that describes the probability of getting each particular item right or wrong given the
ability of the test takers (Kaplan & Saccuzzo, 1997). The Rasch model is appropriate for modeling
dichotomous responses and models the probability of an individual's correct response on a dichotomous
item. The logistic item characteristic curve, a function of ability, forms the boundary between the probability
areas of answering an item incorrectly and answering the item correctly. This one-parameter logistic model
assumes that the discriminations of all items are assumed to be equal to one (Maier, 2001).
Another fundamental feature of this theory is that item performance is related to the estimated
amount of respondent’s latent trait (Anastasi & Urbina, 2002). A latent trait
is symbolized as theta (θ) which refers to a statistical construct. In cognitive tests, latent traits are called
the ability measured by the test. The total score on a test is taken as an estimate of that ability. A person’s
specified ability (θ) succeeds on an item of specified difficulty.
There are various approaches to the construction of tests using item response theory. Some
approaches use the two-dimensions: Item discriminations and item difficulties are plotted. Other
approaches use a three-dimension for the probability of test takers with very low levels of ability getting a
correct response (as demonstrated in figure 2). Other approaches use only the difficulty parameter (one
dimension) such as the Rasch Model. All these approaches characterize the item in relation to the
probability that those who do well or poorly on the exam will have different levels of performance.
Two – Parameter Model/Normal – Ogive Model. The ogive model postulates a normal cumulative
distribution function as a response function for an item. The model demonstrates that an item difficulty is a
point on an ability scale where an examinee has a probability of success on the item of .50 (van der Linden
& Hambleton, 2004). In the model, the difficulty of each item can be defined by 50% threshold which is
customary in establishing sensory thresholds in psychophysics. The discriminative power of each item
represented by a curve in the graph is indicated by its steepness. The steeper the curve, the higher the
correlation of item performance with total score and the higher the discriminative index.
The original idea of the model was traced back from Thurstone’s use of the normal model in his
discriminal dispersion theory of stimulus perception (Thurstone, 1927). Researchers in psychophysics
study the relation between psychophysical properties from a stimuli and their perception from human
subjects. Stimuli scaling will be presented in detail further in this paper. In the process a stimulus is
presented to the subject and he/she will report the detection of the stimulus. The detection increases as the
stimulus intensity also increases. With this pattern, the cumulative distribution with parametrization was
used a s a function.
121
Three – Parameter Model/Logistic Model. In plotting an ability (θ) with the probability of
correct response Pi (θ) in a three parameter model, the slope of the curve itself indicates the item
discrimination. The higher the value of the item discrimination, the steeper the slope. In the model,
Birnbaum (1950) proposed a third parameter to account for the nonzero performance of low ability
examinees on multiple choice items. The nonzero performance is due to the probability of guessing correct
answers to multiple choice items (van der Linden & Hambleton, 2004).
Figure 2. Hypothetical Item Characteristic Curves for Three Items.
The item difficulty parameter (b1, b2, b3) corresponds to the location on the ability axis at which the
probability of a correct response is .50. It is shown in the curve that item 1 is easier and item 2 and 3 have
the same difficulty at .50 probability of correct response. Estimates of item parameters and ability are
typically computed through successive approximations procedures where approximations are repeated until
the values stabilize.
One – Parameter Model/Rasch Model. The Rasch model is based on the assumption that both
guessing and item differences in discrimination are negligible. In constructing tests, the proponents of the
Rasch model frequently discard those items that do not meet these assumptions (Anastasi & Urbina, 2002).
Rasch began his work in educational and psychological measurement in the late 1940’s. Early in the 1950’s
he developed his Poisson models for reading tests and a model for intelligence and achievement tests
which was later called the “structure models for items in a test” which is called today as the Rasch model.
Rasch’s (1960) main motivation for his model was to eliminate references to populations of
examinees in analyses of tests. According to him that test analysis would only be worthwhile if it were
individual centered with separate parameters for the items and the examinees (van der Linden &
Hambleton, 2004). His worked marked IRT with its probabilistic modeling of the interaction between an
individual item and an individual examinee. The Rasch model is a probabilistic unidimensional model which
asserts that (1) the easier the question the more likely the student will respond correctly to it, and (2) the
more able the student, the more likely he/she will pass the question compared to a less able student .
The Rasch model was derived from the initial Poisson model illustrated in the formula:
122
δ
ε=
θ
where ε is a function of parameters describing the ability of examinee and difficulty of the test, θ
represents the ability of the examinee and δ represents the difficulty of the test which is estimated by the
summation of errors in a test. Furthermore, the model was enhanced to assume that the probability that a
student will correctly answer a question is a logistic function of the difference between the student's ability
[ ] and the difficulty of the question [ ] (i.e. the ability required to answer the question correctly), and only a
function of that difference giving way to the Rasch model
From this, the expected pattern of responses to questions can be determined given the estimated
and . Even though each response to each question must depend upon the students' ability and the
questions' difficulty, in the data analysis, it is possible to condition out or eliminate the student's abilities (by
taking all students at the same score level) in order to estimate the relative question difficulties (Andrich,
2004; Dobby & Duckworth, 1979). Thus, when data fit the model, the relative difficulties of the questions
are independent of the relative abilities of the students, and vice versa (Rasch, 1977). The further
consequence of this invariance is that it justifies the use of the total score (Wright & Panchapakesan,
1969). In the current analysis this estimation is done through a pair-wise conditional maximum likelihood
algorithm.
The Rasch model is appropriate for modeling dichotomous responses and models the probability of
an individual's correct response on a dichotomous item. The logistic item characteristic curve, a function of
ability, forms the boundary between the probability areas of answering an item incorrectly and answering
the item correctly. This one-parameter logistic model assumes that the discriminations of all items are
assumed to be equal to one (Maier, 2001).
According to Fischer (1974) the Rasch model can be derived from the following assumptions:
(1) Unidimensionality. All items are functionally dependent upon only one underlying continuum.
(2) Monotonicity. All item characteristic functions are strictly monotonic in the latent trait, u. The
item characteristic function describes the probability of a predefined response as a function of the latent
trait.
(3) Local stochastic independence. Every person has a certain probability of giving a predefined
response to each item and this probability is independent of the answers given to the preceding items.
(4) Sufficiency of a simple sum statistic. The number of predefined responses is a sufficient statistic
for the latent parameter.
(5) Dichotomy of the items. For each item there are only two different responses, for example
positive and negative. The Rasch model requires that an additive structure underlies the observed data.
This additive structure applies to the logit of Pij, where Pij is the probability that subject i will give a
predefined response to item j, being the sum of a subject scale value ui and an item scale value vj, i.e. In
(Pij/1 - Pij) = ui + vj
There are various applications of the Rasch Model in test construction through item-mapping
method (Wang, 2003) and as a hierarchical measurement method (Maier, 2001).
123
Rasch Standard-setting Through Item-mapping. According to Wang (2003) that it is logical to

justify the use of an item-mapping method for establishing passing scores for multiple-choice licensure and
certification examinations. In the study the researcher wanted to determine a score that decides a passing
level of competency using the Angoff as a standard setting method in the Rasch Model. The Angoff (1971)
procedure with various modifications is the most widely used for multiple-choice licensure and certification
examinations (Plake, 1998). As part of the Angoff standard-setting process, judges are asked to estimate
the proportion (or percentage) of minimally competent candidates (MCC) who will answer an item correctly.
These item performance estimates are aggregated across items and averaged across judges to yield the
recommended cut score. As noted (Chang, 1999; Impara & Plake, 1997; Kane, 1994), the adequacy of a
judgmental standard-setting method depends on whether the judges adequately conceptualize the minimal
competency of candidates, and whether judges accurately estimate item difficulty based on their
conceptualized minimal competency. A major criticism of the Angoff method is that judges' estimates of
item difficulties for minimal competency are more likely to be inaccurate, and sometimes inconsistent and
contradictory (Bejar, 1983; Goodwin, 1999; Mills & Melican, 1988; National Academy of Education [NAE],
1993; Reid, 1991; Shepard, 1995). Studies found that judges are able to rank order items accurately in
terms of item difficulty, but they are not particularly accurate in estimating item performance for target
examinee groups (Impara & Plake, 1998; National Research Council, 1999; Shepard, 1995). A fundamental
flaw of the Angoff method is that it requires judges to perform the nearly impossible cognitive task of
estimating the probability of MCCs answering each item in the pool correctly (Berk, 1996; NAE).
An item-mapping method, which applies the Rasch IRT model to the standard setting process, has
been used to remedy the cognitive deficiency in the Angoff method for multiple-choice licensure and
certification examinations (McKinley, Newman, & Wiser, 1996). The Angoff method limits judges to each
individual item while they make an immediate judgment of item performance for MCCs. In contrast, the
item-mapping method presents a global picture of all items and their estimated difficulties in the form of a
histogram chart (item map), which serves to guide and simplify the judges' process of decision making
during the cut score study. The item difficulties are estimated through application of the Rasch IRT model.
Like all IRT scaling methods, the Rasch estimation procedures can place item difficulty and candidate
ability on the same scale. An additional advantage of the Rasch measurement scale is that the difference
between a candidate's ability and an item's difficulty determines the probability of a correct response
(Grosse & Wright, 1986). When candidate ability equals item difficulty, the probability of a correct answer to
the item is .50. Unlike the Angoff method, which requires judges to estimate the probability of an MCC's
success on an item, the item-mapping method provides the probability (i.e., .50) and asks judges to
determine whether an MCC has this probability of answering an item correctly. By utilizing the Rasch
model's distinct relationship between candidate ability and item difficulty, the item-mapping method enables
judges to determine the passing score at the point where the item difficulty equals the MCC's ability level.
The item-mapping method incorporates item performance in the standard-setting process by
graphically presenting item difficulties. In item mapping, all the items for a given examination are ordered in
columns, with each column in the graph representing a different item difficulty. The columns of items are
ordered from easy to hard on a histogram-type graph, with very easy items toward the left end of the graph,
and very hard items toward the right end of the graph. Item difficulties in log odds units are estimated
through application of the Rasch IRT model (Wright & Stone, 1979). In order to present items on a metric
familiar to the judges, logit difficulties are converted to scaled values using the following formula: scaled
difficulty = (logit difficulty × 10) + 100. This scale usually ranges from 70 to 130.
124
Figure 3. Example of Item Map.
In the example, the abscissa of the graph represents the rescaled item difficulty. Any one column has items
within two points of each other. For example, the column labeled "80" has items with scaled difficulties
ranging from 79 to values less than 81. Using the scaling equation, this column of items would have a
range of logit difficulties from -2.1 to values less than -1.9, yielding a 0.2 logit difficulty range for items in
this column. Similarly, the next column on its right has items with scaled difficulties ranging from 81 to
values less than 83 and a range of logit difficulties from -1.9 to values less than -1.7. In fact, there is a 2-
point range (1 point below the labeled value and 1 point above the labeled value) for all the columns on the
item map. Within each column, items are displayed in order by item ID numbers and can be identified by
color and symbol-coded test content areas. By marking item content areas of the items on the map, a
representative sample of items within each content area can be rated in the standard-setting process. The
goal of item mapping is to locate a column of items on the histogram where judges can reach consensus
that the MCC has a .50 chance answering the items correctly.
Rasch Hierarchical Measurement Method. In a study by Maier (2001) a hierarchical measurement

model is developed that enables researchers to measure a latent trait variable and model the error variance
corresponding to multiple levels. The Rasch hierarchical measurement model (HMM) results when a Rasch
IRT model and a one-way ANOVA with random effects are combined. Item response theory models and
hierarchical linear models can be combined to model the effect of multilevel covariates on a latent trait.
Through the combination, researchers may wish to examine relationships between person-ability estimates
and person-level and contextual-level characteristics that may affect these ability estimates. Alternatively, it
is also possible to model data obtained from the same individuals across repeated questionnaire
administrations. It is also made possible to study the effect of person characteristics on ability estimates
over time.
Advantages of the IRT
The benefits of the item response theory is that its treatment of reliability and error of measurement
through item information function are computed for each item (Lord, 1980). These functions provide a
sound basis for choosing items in test construction. The item information function takes all items
125
parameters into account and shows the measurement efficiency of the item at different ability levels.
Another advantage of the item response theory is the invariance of item parameters which pertains to the
sample-free nature of its results. In the theory the item parameters are invariant when computed in groups
of different abilities. This means that a uniform scale of measurement can be provided for use in different
groups. It also means that groups as well as individuals can be tested with a different set of items,
appropriate to their ability levels and their scores will be directly comparable (Anastasi & Urbina, 2002).
Scaling Models
Measurement essentially is concerned with the methods used to provide quantitative descriptions
of the extent to which individuals manifest or possess specified characteristics” (Ghiselli, Campbell, &
Zedeck, 1981, p. 2). “Measurement is the assigning of numbers to individuals in a systematic way as a
means of representing properties of the individuals” (Allen & Yen, 1979, p. 2). “‘Measurement’ consists of
rules for assigning symbols to objects so as to (1) represent quantities of attributes numerically (scaling) or
(2) define whether the objects fall in the same or different categories with respect to a given attribute
(classification)” (Nunnally &Bernstein, 1994, p. 3).
There are important aspects to consider in the process of measurement in psychometrics. First, it
is needed to quantify an attribute of interest. That is, there are numbers to designate how much (or little) of
an attribute an individual possesses. Second, attribute of interest must be quantified in a consistent and
systematic way (i.e., standardization). That is, when the measurement process is replicated, it is systematic
enough that meaningful replication is possible. Finally, attributes of individuals (or objects) are measured
not the individuals per se.
Levels of Measurement
As the definition of Nunnally and Bernstein (1994) suggests, by systematically measuring the
attribute of interest individuals can either be classified or scaled with regard to the attribute of interest.
Engaging in classification or scaling depends in large part on the level of measurement used to assess a
construct. For example, if the attribute is measured on a nominal scale of measurement, then it is only
possible to classify individuals as falling into one or another mutually exclusive category (Agresti & Finlay,
2000). This is because the different categories (e.g., men versus women) represent only qualitative
differences. Nominal scales are used as measures of identity (Downie & Heath, 1984). When gender are
coded such as males coded 0, females 1 that does not mean that these values have any quantitative
meaning. They are simply labels for gender categories. At the nominal level of measurement, there are a
variety of sorting techniques. In this case, subjects are asked to sort the stimuli into different categories
based on some dimension.
There are some data that reflect rank order of individuals or objects such as a scale evaluating the
beauty of a person from highest to lowest (Downie & Heath, 1984). This would represent an ordinal scale of
measurement where objects are simply rank ordered. It does not provide how much hotter one object is
than another, but it can be determined that that A is hotter than B, if A is ranked higher than B. At the
ordinal level of measurement, the Q-sort method, paired comparisons, Guttman’s Scalogram, Coomb’s
unfolding technique, and a variety of rating scales can be used. The major task of subject is to rank order
items from highest to lowest or from weakest to strongest.
The interval scale of measurement have equal intervals between degrees on the scale. However,
the zero point on the scale is arbitrary; 0 degrees Celsius represents the point at which water freezes at
sea level. That is, zero on the scale does not represent “true zero,” which in this case would mean a
126
complete absence of heat. In determining the area of a table a ratio scale of measurement is used because
zero does represent “true zero”.
When the construct of interest is measured at the nominal (i.e., qualitative) level of measurement,
objects are only classified into categories. As a result, the types of data manipulations and statistical
analyses that can be perform on the data is very limited. In cases of descriptive statistics, it is possible to
compute frequency counts or determine the modal response (i.e., category), but not much else. However, if
it were at least possible to rank order the objects based on the degree to which the construct of interest
possess, then it is possible to scale the construct. In addition, higher levels of measurement allow for more
in-depth statistical analyses. With ordinal data, for example, statistics such as the median, range, and
interquartile range can be computed (Downie & Heath, 1984). When the data is interval-level, it is possible
to calculate statistics such as means, standard deviations, variances, and the various statistics of shape
(e.g., skewness and kurtosis). With interval-level data, it is important to know the shape of the distribution,
as different-shaped distributions imply different interpretations for statistics such as the mean and standard
deviation.
At the interval and ratio level of measurement, there is direct estimation, the method of bisection,
and Thurstone’s methods of comparative and categorical judgments. With these methods, subjects are
asked not only to rank order items but also to actually help determine the magnitude of the differences
among items. With Thurstone’s method of comparative judgment, subjects compare every possible pair of
stimuli and select the item within the pair that is the better item for assessing the construct. Thurstone’s
method of categorical judgment, while less tedious for subjects when there are many stimuli to assess in
that they simply rate each stimulus (not each pair of stimuli), does require more cognitive energy for each
rating provided. This is because the SME must now estimate the actual value of the stimulus.
Unidimensional Scaling Models
Psychological measurement is typically most interested in scaling some characteristic, trait, or

ability of a person. It determines how much of an attribute of interest a given person possesses. This will
allow to estimate the degree of inter-individual and intra-individual differences among the subjects on the
attribute of interest. There are various ways of scaling such as scaling the stimuli given to individuals, as
well as the responses that individuals provide.
Scaling for a Stimuli (Psychophysics)
Scaling of stimuli is more prominent in the area of psychophysics or sensory/perception psychology

that focuses on physical phenomena and whose roots date back to mid–19th century Germany. It was not
until the 1920s that Thurstone began to apply the same scaling principles to scaling psychological attitudes.
In the process of psychophysical scaling one factor is held constant (e.g., responses), collapse across a
second (e.g., stimuli), and then scale the third (e.g., individuals) factor. With psychological scaling,
however, it is typical to ask participants to provide their professional judgment of the particular stimuli,
regardless of their personal feelings or attitudes toward the topic or stimulus. This may include ratings of
how well different stimuli represent the construct and at what level of intensity the construct is represented.
In scaling for stimuli, research issue frequently concern the exact nature of functional relations
between scaling of the stimuli in different circumstances (Nunnaly, 1970).
There are variety of ways on scaling for stimuli through psychophysical method. Psychophysical
methods examine the relationship between the placement of objects on the two scales and attempts to
establish principles or laws that connect the two (Roberts, 1999). The following psychophysical method
includes rank order, constant stimuli and successive categories.
127
(1) Method of Adjustment - An experimental paradigm which allows the subject to make small
adjustments to a comparison stimulus until it matches a standard stimulus. The intensity of the stimulus is
adjusted until target is just detectable.
(2) Method of Limits – Adjust intensity in discreet steps until observer reports that stimulus is just
detectable.
(3) Method of Constant Stimuli – Experimenter has control of stimuli. Several chosen stimulus
values are chosen to bracket the assumed threshold. Stimulus is presented many times in random order.
Psychometric function is derived from proportion of detectable responses.
(4) Staircase Method – To determine a threshold as quickly as possible. Compromise between the
method of limits and method of constant stimuli.
(5) Method of Forced Choice (2AFC) – Observer must choose between two or more options. Good
for cases where observers are less willing to guess.
(6) Method of Average Error – The subject is presented with a standard stimuli. The subject then
undergoes trials to target the stimulus presented.
(7) Rank order – requires the subject to rank stimuli from most to least with respect to some
attribute of judgment or sentiment.
(8) Paired comparison – a subject is required to rank a stimuli two at a time in all possible pairs.
(9) Successive categories – the subject is asked to sort a collection of stimuli into a number of
distinct piles or categories, which are ordered with respect to a specified attribute.
(10) Ratio judgment – The experimenter selects a standard stimulus and a number of variable
stimuli that differ quantitatively from the standard stimulus on a given characteristic. The subjects selects
from the range of variable stimuli, the stimulus whose amount of the given characteristic corresponds to the
ratio value.
(11) Q sort – subjects are required to sort the stimuli into an approximate normal distribution, with
its being specified how many stimuli are to be placed in each category.
Scaling for People (Psychometrics)
Many issues arise when performing a scaling study. One important factor is who is selected to
participate in the study. Many stimuli or scaling involve some psychological (latent) dimension of people
without any connection to a direct counterpart "physical" dimension. When people (psychometrics) are
scaled, it is typical to obtain a random sample of individuals from the population to generalize. With
psychometrics participants are asked to provide their individual feelings, attitudes, and/or personal ratings
toward a particular topic. In doing so, one is able to determine how individuals differ on the construct of
interest. With stimulus scaling, however, the researcher would sum across raters within a given stimulus
(e.g., question) in order to obtain rating(s) of each stimulus. Once the researcher is confident that each
stimulus did, in fact, tap into the construct and had some estimate of the level at which it did so, only then
should the researcher feel confident in presenting the now scaled stimuli to a random sample of relevant
participants for psychometric purposes. Thus, with psychometrics, items (i.e., stimuli) are summed across
within an individual respondent in order to obtain his or her score on the construct.
The major requirement in scaling for people is that variables should be monotonically related to
each other. A relationship is monotonic if higher scores in one scale correspond to higher scores on
another scale, regardless of the shape of the curve (Nunnally, 1970). In scaling for people many items on a
t4st is used to minimize measurement error. The specificity of items can be averaged when they are
combined. By combining items, one can make relatively fine distinctions between people. The problem of
128
scaling people with respect to attributes is then one of collapsing responses to a number of items as to
obtain one score for each person.
One variety of scaling for people is the deterministic model and it assumes that there is no error in
item trace lines. Trace lines shows that a high level of ability would have a probability close to 1.0 of
correctly obtaining a response. The model assumes that up to a point on the attribute, the probability of
response alpha is zero and beyond that point the probability of response alpha is 1.0. Each item has a
biserial correlation of 1.0 with the attribute , and consequently each item perfectly discriminates at a
particular point of the attribute.
There are varieties of scaling models for people that includes Thurstone, Lickert scale, Guttman
scale, and Semantic differential scaling.
(1) Thurstone scaling – There are 300 or so judges to rate 100 statements on a particular issue on
an 11 point scale. A subset of statements are then shown to respondents and their score is the mean of the
ratings for the statement they select.
(2) Lickert scale - Respondents are request to state their level of agreement with a series of
attitude statements. Each scale point is given a value (say, 1- 5) and the person is given the core
corresponding to their degree of agreement. Often a set of Likert items are summed to provide a total score
for the attitude.
(3) Guttman scale - It involves producing a set of statements that form a natural hierarchy. Positive
answers to the item at one point on the hierarchy assume positive answers to all the statements below (e.
g. disability scale). Gets over problem of item totals being formed by different sets of responses.
Scaling Responses
The third category of responses, which is said to typically hold constant, also needs to be
identified. That is, a decision is arrived in what fashion will the subjects respond to a stimuli. Such response
options may include requiring participants to make comparative judgments (e.g.,which is more important, A
or B?), subjective evaluations (e.g., strongly agree to strongly disagree), or an absolute judgment (e.g., how
hot is this object?). Different response formats may well influence how to write and edit stimuli. In addition,
they may also influence how one evaluates the quality or the “accuracy” of the response. For example,
with absolute judgments, standard of comparisons are used, especially if subjects are being asked to rate
physical characteristics such as weight, height, or intensity of sound or light. With attitudes and
psychological constructs, such “standards” are hard to come by. There are a few options (e.g., Guttman’s
Scalogram and Coomb’s unfolding technique) for simultaneously scaling people and stimuli, but more often
than not only one dimension is scaled at a time. However, a stimuli is scaled first (or seek a well-
established measure) before having confidence in scaling individuals on the stimuli.
Multidimensional Scaling Models
With unidimensional scaling, as described previously, subjects are asked to respond to stimuli with
regard to a particular dimension. With multidimensional scaling (MDS), how-ever, subjects are typically
asked to give just their general impression or broad rating of similarities or differences among stimuli.
Subsequent analyses, using Euclidean spatial models, would “map” the products in multidimensional
space. The different multiple dimensions would then be “discovered” or “extracted” with multivariate
statistical techniques, thus establishing which dimensions the consumer is using to distinguish the
products. MDS can be particularly useful when subjects are unable to articulate “why” they like a stimulus,
yet they are confident that they prefer one stimulus to another.
129
References
Agresti, A. & Finlay, B. (1997). Statistical methods for the social sciences (3rd ed.). New Jersey: Prentice
Hall.
Anastasi, A. & Urbina, S. (2002). Psychological testing. Prentice Hall: New York.
Andrich, D. (1998). Rasch models for measurement. Sage University: Sage Publications.
Angoff, W. H. (1971). Scales, norms, and equivalent scores. In R. L. Thorndike (Ed.), Educational
measurement (2nd ed., pp. 508-600). Washington, DC: American Council on Education.
Bejar, I. I. (1983). Subject matter experts' assessment of item statistics. Applied Psychological
Measurement, 7, 303-310.
Berk, R. A. (1996). Standard setting: the next generation. Applied Measurement in Education, 9, 215-235.
Chang, L. (1999). Judgmental item analysis of the Nedelsky and Angoff standard-setting methods. Applied
Measurement in Education, 12, 151-166.
Crocker, L. M., & Algina, J. (1986). Introduction to classical and modern test theory. Belmont, CA:
Wadsworth.
Dobby J, & Duckworth, D (1979): Objective assessment by means of item banking. Schools Council
Examination Bulletin, 40, 1-10.
Downie, N.M., & Heath, R.W. (1984). Basic statistical methods (5th ed.). New York: Harper & Row
Publishers.
Fischer, G. H. (1974) Derivations of the Rasch Model. In Fischer, G. H. & Molenaar, I. W. (Eds) Rasch
Models: foundations, recent developments and applications, pp. 15-38 New York: Springer Verlag.
Ghiselli, E. E., Campbell, J. P., & Zedeck, S. (1981). Measurement theory for the behavioral sciences. New
York: W. H. Freeman.
Guildford, J. P. (1954). Psychometric methods. New York: McGraw-Hill.
Goodwin, L. D. (1999). Relations between observed item difficulty levels and Angoff minimum passing
levels for a group of borderline candidates. Applied Measurement in Education, 12, 13-28.
Grosse, M. E., & Wright, B. D. (1986). Setting, evaluating, and maintaining certification standards with the
Rasch model. Evaluation and the Health Professions, 9, 267-285.
Hambelton, R. K. (2000). Emergence of item response modeling in instrument development and data
analysis. Medical Care, 38, 60-65.
130
Impara, J. C., & Plake, B. S. (1997). Standard setting: An alternative approach. Journal of Educational
Measurement, 34, 353-366.
Impara, J. C., & Plake, B. S. (1998). Teachers' ability to estimate item difficulty: A test of the assumptions in
the Angoff standard setting method. Journal of Educational Measurement, 35, 69-81.
Kane, M. (1994). Validating the performance standards associated with passing scores. Review of
Educational Research, 64, 425-461.
Kaplan, R. M. & Saccuzo, D. P. (1997). Psychological testing: Principles, applications and issues. Pacific
Grove: Brooks Cole Pub. Company.
Lord, F. M. (1980). Applications of item response theory to practical testing problems. Hillsdale, NJ:
Erlbaum.
Magno, C. (2003). Relationship between attitude towards technical education and academic achievement
in mathematics and science of the first and second year high school students, caritas don bosco school, sy
2002 – 2003. An unpublished master’s thesis, Ateneo de Manila University, Quezon City, Manila.
Maier, K. S. (2001). A Rasch hierarchical measurement model. Journal of Educational and Behavioral
Statistics, 26, 307-331.
McKinley, D. W., Newman, L. S., & Wiser, R. F. (1996, April). Using the Rasch model in the standard-
netting process. Paper presented at the annual meeting of the National Council of Measurement in
Education, New York, NY.
Mills, C. N., & Melican, G. J. (1988). Estimating and adjusting cutoff scores: Future of selected methods.
Applied Measurement in Education, 1, 261-275.
National Academy of Education (1993). Setting performance standards for student achievement. Stanford,
CA: Author.
National Research Council (1999). Setting reasonable and useful performance standards. In J. W.
Pellegrino, L. R. Jones, & K. J. Mitchell (Eds.), Grading the nation's report card: Evaluating NAEP and
transforming the assessment of educational progress (pp. 162-184). Washington, DC: National Academy
Press.
Novick, M. R. (1966). The axioms and principal results of classical test theory. Journal of mathematical
psychology, 3, 1 – 18.
Nunnally, J. C., & Bernstein, I. H. (1994). Psychometric theory (3rd ed.) New York: McGraw-Hill.
Plake, B. S. (1998). Setting performance standards for professional licensure and certification. Applied
Measurement in Education, 11, 65-80.
Reid, J. B. (1991). Training judges to generate standard-setting data. Educational Measurement: Issues
and Practice, 10, 11-14.
131
Rasch, G. (1960). Probabilistic models for some intelligence and attainment tests. Copenhagen, Denmark:
Danish Institute for Educational Research.
Rasch, G. (1977). On specific objectivity: An attempt at formalizing the request for generality and validity of
scientific statements. In G. M. Copenhagen (ed.). The Danish yearbook of philosophy (pp.58-94).
Munksgaard.
Shepard, L. A. (1995). Implications for standard setting of the national academy of education evaluation of
the national assessment of educational progress achievement levels. Proceedings of Joint Conference on
Standard Setting for Large-Scale Assessments (pp. 143-160). Washington, DC: The National Assessment
Governing Board (NAGB) and the National Center for Education Statistics (NCES).
Shepard, L. A. (1995). Implications for standard setting of the national academy of education evaluation of
the national assessment of educational progress achievement levels. Proceedings of Joint Conference on
Standard Setting for Large-Scale Assessments (pp. 143-160). Washington, DC: The National Assessment
Governing Board (NAGB) and the National Center for Education Statistics (NCES).
Spearman,, C. (1904). The proof and measurement of association between two things. American Journal of
Psychology, 15, 72 – 101.
Torgerson, W. S. (1958). Theory and methods of scaling. New York: Wiley.
Thurstone, L. L. (1927). The unit of measurements in educational scales. Journal of Educational

Psychology, 18, 505 – 524.
Wang, N. (2003). Use of the Rasch IRT model in standard setting: An item-mapping method. JEM, 40, 231.
Van der Linden, W. J. & Hambleton, R. K. (2004). Item response theory: Brief history, common models, and
extension. New York: Mc Graw Hill.
van der Ven, A. H. G. S. (1980). Introduction to scaling. New York: Wiley.
Wright, B. D., & Stone, M. H. (1979). Best test design: Rasch measurement. Chicago: MESA Press.
Wright BD, Panchapakesan N (1969). A procedure for sample free item analysis.
Educational and Psychological Measurement, 29, 23-48.
132
Exercise:
Calibrate the item difficulty and person ability of the scores in a Reading Comprehension test
with 19 items among 15 Korean students. After performing he Rasch Model, determine item
difficulty using the Classical test theory approach. Compare the results.
Item Item Item Item Item Item Item Item Item Item Item Item Item Item Item Item Item Item Item
Case
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
A 0 1 1 1 0 1 1 0 0 1 1 1 0 0 0 0 1 0 1
B 0 0 1 1 0 0 0 0 1 1 1 1 1 1 0 0 0 0 0
C 0 0 0 1 0 1 0 0 0 0 1 1 0 1 1 0 1 0 1
D 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 1
E 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1
F 0 0 0 1 0 0 1 0 1 1 0 0 0 1 0 0 1 1 0
G 1 0 0 1 0 1 0 0 0 0 1 1 0 0 0 0 0 0 1
H 0 0 1 1 0 1 0 0 0 0 0 1 1 0 0 0 0 1 1
I 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
J 0 0 1 1 0 0 0 0 1 0 1 0 0 1 1 0 1 1 1
K 1 0 0 1 0 0 0 0 0 1 0 1 1 1 0 1 0 1 0
L 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 1 0 1
M 0 0 0 0 0 0 0 0 0 1 1 0 0 1 1 0 1 0 1
N 0 0 1 1 1 1 1 1 1 0 0 1 0 1 0 0 1 0 1
O 0 0 0 0 0 1 0 0 1 0 0 1 0 0 0 0 1 0 1
References
Anastasi, A. & Urbina, S. (2002). Psychological testing (7th ed.). NJ: Prentice Hall.
DiLeonardi, J. W. & Curtis, P. A. (1992). What to do when the numbers are in: A users guide to
statistical data analysis in the human services. Chicago IL, Nelson-Hall Inc.
Kaplan, R. M. & Saccuzzo, D. P. (1997). Psychological testing: Principles, applications, and

issues (4th ed.). Pacific Grove, USA: Brooks/Cole Publishing Company.
Magno, C. (2007). Exploratory and confirmatory factor analysis of parental closeness and
multidimensional scaling with other parenting models. The Guidance Journal, 36, 63-89.
Magno, C., Lynn, J, Lee, K., & Kho, R. (in press). Parents’ School-Related Behavior: Getting
Involved with a Grade School and College Child. The Guidance Journal.
133
Magno, C., Tangco, N., & Tan, C, (2007). The role of metacognitive skills in developing critical
thinking. Paper presented at the Asian Association of Social Psychology in Universiti Malaysia,
Kota Kinabalu, Sabah Malaysia, July 25 to 28.
Payne, D. A. (1992). Measuring and evaluating educational outcomes. MacMillan Publishing

Company: New York.
Rasch, G. (1960). Probabilistic models for some intelligence and attainment tests. Copenhagen,
Denmark: Danish Institute for Educational Research.
Rasch, G. (1977). On specific objectivity: An attempt at formalizing the request for generality
and validity of scientific statements. In G. M. Copenhagen (ed.). The Danish yearbook of
philosophy (pp.58-94). Munksgaard.
Van der Linden, W. J. & Hambleton, R. K. (2004). Item response theory: Brief history, common
models, and extension. New York: Mc Graw Hill.
Wright, B. D., & Stone, M. H. (1979). Best test design: Rasch measurement. Chicago: MESA
Press.
134
Chapter 4
Developing Teacher-made Tests
Objectives
1. Explain the theories and concepts that rationalize the practice of assessment.
2. Make a table of specification of the test items.
3. Design pen-and-paper tests that are aligned to the learning intents.
4. Justify the advantages and disadvantages of any pen-and-paper test.
5. Evaluate the test items according to the guidelines.
Lessons
1 The Test Blueprint

2 Designing Selected-Response Items
Binary-choice items
Multiple-choice items
Matching items
3 Designing Constructed-Response Items
Short-answer items
Essay items
135
Lesson 1
The Test Blueprint
As we have mentioned in the previous chapters, teaching involves decision-making. This

chapter discusses another aspect of teaching that requires intelligent and informed decisions from
teachers. In this chapter, we wish to provide teachers with the basic scaffolds for developing pen-
and-paper tests in the hope of meeting the following objectives:
As the term suggests, teacher-made tests are assessment tools, particularly pen-and-paper
types, that teachers develop, use, and assess based on the learning targets set of the task or
domain to be tested. It makes use of content, knowledge, as well as process domains. The content
domain is the subject area from which items are drawn. In its general sense, it is the subject or
course (i.e., science, math, English, etc.) in which testing is to be made. Specifically, it covers
specific topics under a subject area (i.e., the laws of motion, addition of fractions, or use of a
singular verb in a sentence). The knowledge domain involves those dimensions or types of
knowledge to be tested. In the revised taxonomy, this domain involves those knowledge
dimensions as factual, conceptual, procedural, and metacognitive knowledge types. As for the
process domain, any pen-and-paper test involves the aspects of mental processes that students
use to engage the task in the test. In the revised taxonomy, those mental procedures as
remembering, understanding, applying, and so on, are the processes that may be tested.
A. Call to mind those alternative taxonomic tools in Chapter 2.

B. Identify the knowledge domain and the process domain of each alternative
taxonomy.
C. Monitor your understanding by clearly accounting for what you already know
about these domains, or by figuring out those areas that you do not understand
yet.
D. Formulate questions regarding what you wish to clarify about the matter that you
do not clearly understand.
E. When appropriate, raise your questions in class or discuss with your classmates.
To make sure that you have these domains accounted for in your assessment design,
engage yourself to make a table of specification, one that will allow you to explicitly indicate
what content to cover in your test, what knowledge dimensions to focus, and what cognitive
processes to pay attention to.
The Table of Specification is a matrix where the rows consist of the specific topic or
skills (content or skill areas) and the columns are the learning behaviors or competencies that we
desire to measure. Although e can also add more elements in the matrix, such as Test Placement,
Equivalent Points, or Percent values of items, the conventional prototype table of specification
may looks like this:
136
Cognitive Processes
Content (or Skill) Areas Knowledge Application Analysis TOTAL
1. Translation from words 1 2 2 5

to mathematical symbols
2. Forming the Linear 1 3 2 6
Equation
3. Solving the Linear 3 1 4
Equation
4. Checking the Answer 3 2 5
TOTAL 2 11 7 20
The number of items Number of items for solving The total number
measuring Knowledge linear equation, measuring of test items
Application
As you have seen in the above table of specification, only three cognitive processes are
indicated. This means that if you use the old Bloom’s taxonomy of behavioral objectives, include
only those levels that you wish to measure in the test, although it is recommended that more than
a single processes should be measured in a test, depending, of course, on your purpose of testing.
As a test blueprint, the table of specification ensures that the teacher sees all the essential
details of testing and measuring student learning. It makes the teacher sure that the content areas
(or skill areas) and the levels of behavior in which learning is hoped to anchor are measured. The
test’s degree of difficulty may also be seen in the table of specification. When the distribution of
test items is concentrated in the higher-order cognitive behaviors (analysis, synthesis,
evaluation), the test’s difficulty level is higher as compared to when the items are concentrated in
the lower-order cognitive behaviors (knowledge, comprehension, application).
As you have learned in Chapter 2 of this book, there are many taxonomic tools that may
be used in our instructional planning. The taxonomic tool for planning the test should be
consistent to the taxonomy of learning objectives used in the overall instructional plan.
Understandably, designing the table of specification using any taxonomic tool will require a little
of our time, effort, and other personal and motivational resources. Before we may be tempted to
develop pen-and-paper test items without first preparing our table of specification, and run the
risk of not actually evaluating our students on the basis of our learning intents, we need to first
brush up on our understanding of the instrumental function of the table of specification as a
blueprint for our test, and convince ourselves that this is an important process in any test
development activity. In developing the table of specification, we suggest that you do not yet
think of the types of pen-and-paper test you wish to give. Instead, you just focus on planning of
your test in terms of your assessment domain.
137
Lesson 2
Designing Selected-Response Items
When you are done with your test blueprint, you are now ready to start developing your
test items. For this phase of test development, you will need to decide what types or methods of
pen-and-paper assessment you wish to design. To aid you in this process, we will now discuss
some of the common pen-and-paper types of test and the basic guidelines in the formulation of
items for each type.
In deciding about the assessment method to use for a pen-and-paper test, you choose
which among the selected response or the constructed response types would be appropriate for
your blueprint. Selected response tests use those types of items that require the test takers to
respond by choosing an option from a list of alternatives. Common types of constructed response
tests are binary-choice, multiple-choice, and matching tests.
Binary-choice Items
The binary-choice test offers students the opportunity to choose between two options for
an answer. The items must be responded to by choosing one of two categorically distinct
alternatives. The true-false test is an example of this type of selected-response test. This type of
selected-response test typically contains short statements that represent less complex
propositions, and therefore, is efficient in assessing certain levels of students’ learning in a
reasonably short period of testing time. In addition to this, a binary-choice test may cover a wider
content area in a brief assessment session (Popham, 2005).
To assist you in developing binary-choice items, here are some guidelines with brief
descriptions of each. These guidelines may not capture everything that you need to be mindful of
in developing teacher-made tests. These are just the basics of what you need to know. It is
important that you also explore on other aspects of test development, including the context in
which the test is to be used, among others.
Make the instructions explicit

Basic in pen-and-paper test is that instructions indicate the task that students need to do
and the credit they can earn from making every correct answer. However there is one more thing
you need to indicate in your instructions for a binary-choice test – the reference of validity or
reference of truth. When you ask your students to judge whether the statement is true or false,
correct or erroneous, or valid or invalid, you need to state the reference of truth or correctness of
a response. If the reference is a reading article, textbook, teacher’s lecture, class discussion, or
resource person, state it in your instructions. This will help students think contextually and on-
track. This will also help you cluster your items according to specific domain or context. Also, it
can minimize the problem of conflict of information, such as one resource material says this and
one person (maybe your student’s parent or another teacher) says otherwise. For items that vary
in context and reference of truth, state the reference in the item itself. For example, if the item is
drawn from a person’s opinion, such as the principal’s speech, or a guest speaker’s ideas, it is
138
important that you attribute to opinion to its source. Lastly, although not a must, it might be nicer
to use “please” and “thank you” in our test instructions.
State the item as either definitely true or false

Statements must be categorically true or false, and not conditionally so. It should clearly
communicate the quality of the idea or concept as to whether it is true, correct, and valid or false,
erroneous, and invalid. Make sure that it clearly corresponds to the reference of validity and that
the context must be explicit. For the quality to be categorical, it must invite only a judgment of
contradictories, not contraries. For example, white or not white implies a contradiction because
one idea is a denial of the other. To say black or white indicates opposing ideas that imply values
between them, such as gray. A good item is one that implies only contradictory, mutually
exclusive qualities, that is, either true or false, and it does not need further qualification in order
to make it true or false.
Keep the statements short, simple, but comprehensible

In formulating binary-choice items, it is wise to consider brevity in the statement. Good
binary-choice items are concisely written so that they present the ideas clearly but avoid
extraneous materials. Making the statements too long is risky in that it might unintentionally
indicate clues that will make your statement obviously true or false. There is actually no clear-cut
rule for brevity. It is usually left to the teacher’s judgment. In preparing the whole binary-choice
test, it is also important that all the items or statements maintain relatively the same length. For a
statement to be comprehensible, it must make a clear sense of the ideas or concepts on focus,
which is usually lost when a teacher lifts a statement from a book and use it as a test item
statement.
Do away with tricks

We remember that the purpose of assessing our students’ learning is based on the
assessment objectives we set. Clearly, solving tricks is remote if not totally excluded in our
intents. Therefore, we need to avoid using tricks, such as using double-negatives in the statement
or switching keys. The use of double-negative statements is a logical trickery because the
“valence” of the statement is still maintained, not altered. These statements are usually puzzling,
and will therefore take more time for students to understand. Switching keys is when you ask
students to answer “false” if the statement is true, or “true” if the statement is false. This is
obviously an unjustifiable trick. By all means, we have to avoid using any kind of tricks not only
in binary-choice tests but also in all other types and methods of assessment.
Get rid of those clues

Clues come in different forms. One of the common clues that can weaken validity and
reliability of our assessment is comes from our use of certain words, such as those that denote
universal quantity or definite degree (i.e., all, everyone, always, none, nobody, never, etc.).
139
These words are usually false because it is almost always wrong to say that one instance applies
to all sorts things. Other verbal clues may come from the use of terms that denote indefinite
degree (i.e., some, few, long time, many years, regularly, frequently, etc.). These words do not
actually indicate a definite quantity or degree, and, thus, violate the rule on definiteness of
quality in letter b. Other clues may come from the way statements are arranged according to the
key, such as alternating items that are true and false, or any other style of placing the items in a
systematic and predictable order. This should be avoided because once the students notice the
pattern, they are not likely to read the items anymore. Instead, they respond to all items
mindlessly but obtain high scores.
Basic in test development is our mindful tracking of our purpose. Binary-choice items
can be a useful tool for assessing learning intents that are drawn from various types of
knowledge, but include only simpler cognitive processes. In this test, students only recall their
understanding of the subject matter covered in assessment domain. They do not manipulate this
knowledge by using more complex, deeper cognitive strategies and processes.
Another important point to consider in deciding whether to use the binary-choice test is
its degree of difficulty. Because this type of test offers only two options, the chance that a
student chooses the correct option is 50%, the remainder is his chance of choosing the wrong
option. This 50-50 probability of selecting the correct answer is problematic because the chance
of answering the question correctly is high even if the student is not quite sure of his
understanding. One way of reducing the likelihood of guessing for the right option is suggested
by Popham (2004), that is to include more items because if students are successful in their
guesswork for a 10-item binary-choice test, it is likely impossible to maintain this success with,
let us say, a 30-item test.
Think of a subject matter in your area of specialization,

something that you have deep and wide knowledge about.
Think of a competency that can be tested by using a binary-

choice type of assessment. Do this by formulating a
statement of learning intent.
Convince yourself as to why the binary-choice test can be

used to test the competency.
Multiple-choice Items
Multiple-choice test is another selected-response type where students respond to every
item by choosing one option among a set of three to five alternatives. The item begins with a
stem followed by the options or alternatives. This type of pen-and-paper test has been widely
140
used in national achievement tests and other high stake assessment, such as the professional
board examinations. Perhaps, the reason for this is because multiple-choice test is capable of
measuring a range of knowledge and cognitive skills, obviously more than what other types of
objective tests can do.
A multiple-choice test may come in two types. The correct-answer type is one whose
items pose specific problems with only one correct answer from the list of alternatives. For
instance, if a stem is followed by four alternatives, only one of them is correct (the keyed
answer), and the other three are incorrect. In this type of multiple-choice test, all items should be
designed in this fashion. The other is the best-answer type where the stem establishes a problem
to be answered by choosing one best option. Understandably, the other options are acceptable but
not necessarily the best alternatives to answer the problem posed in the stem. In this type of
multiple-choice test, only one option is the best answer (keyed answer), and the others may all be
conditionally acceptable, or some are acceptable, some others are totally incorrect.
To guide you in formulating good multiple-choice items, here are some fundamental
guidelines that will be helpful in going through the process.
Make the instructions explicit

When giving a multiple-choice test, the instructions must indicate the content area or
context, the ways in which students respond to every item, and the scoring. If you are using the
correct answer-answer type, it is helpful to the students if your instructions state that they
“choose the correct answer”. Common sense should tell us not to use this expression when our
multiple-choice test is of best-answer type, but “choose the best answer” would be more
appropriate. Lastly, you may want to say “please” and “thank you.”
Formulate a problem
As mentioned above, every item in a multiple-choice test has a stem and a set of
alternatives. The stem should clearly formulate a problem. This is to compel students to respond
to it by choosing one option that will correctly answer the problem or best address it. There are
two ways of posing a problem in the stem of multiple-choice test. One way is by formulating a
question or and interrogative statement. If the stem is “In what year did the first EDSA revolution
happen?” it clearly poses a problem to be answered than “The year when the first EDSA
revolution happen.” The other way to pose a problem in the stem is by formulating an
incomplete sentence where one of the options correctly or best completes it. It may be phrased as
“The first EDSA revolution happened in the year” then the statement is followed by the list of
alternatives. As you will also learn about completion types in the subsequent section of this
chapter, when you use the incomplete sentence format to pose a problem in the stem of a
multiple-choice item, always remove a keyword at the end of the statement, or at least near the
end. If the keyword is at the end of the statement, you don’t end with any punctuation mark or a
blank space. If the missing keyword is near the end of the statement but not necessarily the last
word, replace that keyword which you removed with an underlined blank space, and end your
statement with an appropriate punctuation mark.
141
State the stem in positive form

Ask yourself, how reasonable is it for you to state your item’s stem in a negative form? or
how important is assessing students’ ability to deal with the “negatives” in your test?, you surely
struggle to seek a good answer that justifies the use of negative statements in your multiple-
choice test.
One of the common problems we encounter in a negatively phrased stem is the high
chance of not spotting the word that carries the negation (e.g., not). Another is the difficulty in
anchoring the negative item to the learning intent. In general case, “which one is” will work more
effectively in assessing students’ learning than “which one is not.” The rule-of-thumb says that
you avoid the use of negative statements, unless there is a really compelling reason for why you
will need to phrase your stem in a negative form. If this reason is reasonable enough, you need to
highlight the word that carries the negation, such as writing “not” as “not,” “NOT,” “not,” or
not.
Include only useful alternatives

Remember that the set of alternatives following the stem is a list of options from which
students pick out their response. In any type of multiple-choice test, only one alternative is
keyed, and the rest are distractors. The keyed alternative is ultimately useful because it is what
we expect every student who learned the subject matter should choose. If the set of alternatives
does not contain the expected answer, it is clearly a bad item. This problem is more dreadful in a
correct-answer type than in the best-answer type. At least for the latter, the second best
alternative can stand as the key if the best answer is missing in the list. If the correct answer is
missing in the list of options in a correct-answer type, then there is really no answer to the
problem posed in the stem, and must be removed from the test.
Even if the distractors are not the expected answers, they serve an important function in
the multiple-choice test. As distractors, they should distract those students who do not learn
enough the subject matter, but not those who learn. Therefore, these distractors should be
plausible appear as if they are correct or best options. The way plausible distractors work in a
multiple-choice test is by making the students believe that these distractors are the correct or best
answer even if they are actually not. An important consideration in dealing with the alternatives
is maintaining a homogeneous grouping. For example, if a stem asks about the name of a
particular inventor in science, all alternatives should be names of scientific inventors.
As stated above, a multiple-choice item should have three to five alternatives. Choosing
to include 3, or 4, or 5, depends on the grade level or year level of the class of students you are
handling. We suggest that higher grade- or year-level students be given items with more than 3
options as this will increase the level of test difficulty and reduce the effects of guessing on your
assessment of students’ learning. In instances when you wish that students evaluate each option
as to is plausibility, you may add the option “none of the above” as the fourth or fifth alternative.
However, you have to use this alternative with caution. Use it only for correct-answer type of
multiple-choice test and when you intend to increase the difficulty of an item and that the
142
presence of this option will help you come up with a better inference of your students’ learning.
Let us say, for example, you are testing computational skills of your students using multiple-
choice items and you encourage mental computations as they deal with the item. If you give
them only number options, they may just choose any one option based on simple estimation,
believing that one of them is the correct answer. Adding the “none of the above” option will
encourage students to do mental computation to check on each option’s correctness, because they
know it is possible that the correct answer is not in the list. Obviously, you cannot use this option
in a best-answer type of multiple-choice test.
The option “all of the above” should never be used at all as this can invite guessing that
will work for your students. If your last option (4th or 5th) is “all of the above” and your students
notice at least 2 options that are correct, they are likely to guess that “all of the above” is the
correct option. Similarly, if they spot one incorrect option, automatically they disregard the “all
of the above” option. When they do so, the item’s difficulty is reduced.
One of the instances that teachers are tempted to unreasonably use “none of the above” or
include “all of the above” option even if it is not allowed, is when they force themselves to
maintain the same number of alternatives for their multiple-choice items. In this case, they use
these alternatives as “fillers” in case they run out of options to maintain its number in all the
items. In order to avoid this mistake, it is important to realize that, for classroom testing
purposes, multiple-choice items do not have to come with the same number of options for all its
items. It is okay to have some items with four options while some other items have five.
Scatter the positions of keyed answers

In formulating your multiple-choice items, spread the keyed answers to different response
positions (i.e., a, b, c, d, and e). Make sure the number of items whose keyed answer is “a” is
proportional to the items keyed for each of the other response positions. Better yet, if you give a
20-item multiple-choice test with 4 options per item, key five items to each response position
(25% of items per response position or approximately so).
The good thing about multiple-choice test is that it is capable of measuring skills higher
than just recall or simple comprehension. If properly formulated, the test can measure higher
level thinking (Airasian, 2000). Also, the fact that every item in the multiple-choice test is
followed by more than two response options makes it obtain its reputation of having higher
difficulty level because the probability that one option is correct becomes smaller as you increase
the number of options. Certainly, a 4-option item is more difficult than a 3-option item because
the former indicates only a 25% probability that one option is correct, which is lower than 33%
probability for a 3-option item. A 5-option item is clearly more difficult.
A. Think of a specific subject matter in your field of specialization, one that you are very
familiar with.
B. Write a learning intent that can be measured using multiple-choice test.
C. Formulate at least 5 correct-answer-type items and another 5 best-answer-type items.
D. Check the quality of your output based on the guidelines discussed above. As you do this,
monitor your learning as well as your confusions, doubts or questions.
E. Raise questions in class.
143
Matching Items
Another common type of selected-response test is the matching type of test that comes
with two parallel lists (i.e., premise and response), where students match the entries on one list
with those in the other list. The first list consists of descriptions (words or phrases), each of
which serves as a premise of test item. Therefore, each premise is taken as a test item, and must
be numbered accordingly. Each premise will be matched with the entries in the second (or
response) list. There is only one and the same response list for all the premises in the first list.
In developing good matching items, it is helpful to consider the following hints that will
guide you in the process of designing your lists.
Make instructions explicit

In making your instructions for a matching test explicit, the context, task, and scoring
must be clearly indicated. For its context, your instructions must introduce the description as well
as the response lists. If, for example, your description list contains premises about scientific
inventions you must state in your instructions that the first list (or first column) is about scientific
inventions. If your response list contains names of scientific inventors, you must also state in
your instructions that the second list (or second column) contains names of scientific inventors.
You may phrase it something like this: “In the first column are scientific inventions. The second
column lists names of scientific inventors. Match the inventions with their inventors.” Then
indicate the scoring. Having said this, we suggest that your lists should be labeled with headings
accordingly. In the case of the above example, you may write the column heading as
“Inventions” or “Column A: Inventions” for the first column or description list, and “”Inventors”
or “Column B: Inventors” for the second column or response list.
Maintain brevity and homogeneity of the lists

The list of premises or descriptions must be fairly short, that is, include only those items
that go together as a group. For example, if your matching test covers the common laboratory
operations in chemistry, choose only those that are relevant to your assessment domain. Doing
this, you are also maintaining homogeneity of your list. In matching tests, it is extremely
important that entries in the description list are drawn from one relatively specific assessment
domain. For example, never mix up common laboratory operations with measurements. Instead,
decide as to whether you will include only one of these. The same is true for your response list.
Include only those that belong to the assessment domain. Note here homogeneity in your lists is
non-negotiable.
Also, in writing good matching items, it is imperative that the descriptions are longer than
the responses, not the other way around. After students read one of the descriptions, he reads all
options in the response list. If the description is longer than each of the options, at least, the
144
student only reads it once or twice. If the entries in the response list are long, it will take up more
time for the student to read all options just to respond to one description or item.
Finally, include more options than descriptions. If your description list has 10
descriptions or items, make your responses 12 or a bit more. This strategy reduces the effect of
response elimination where the student already disregards those options already chosen to match
the other descriptions. For example, if the student has already responded to 8 out of 10
descriptions with high confidence of his responses, so far, but finds the last 2 items difficult, with
only 10 options, only 2 options are available for his choice, and therefore, each of the remaining
option has a 50% probability that is it the correct option. If you include more than 10 responses,
the options for the last 2 descriptions would still be more, and the probability that each option is
correct is smaller than 50%. This will reduce the effect of guessing. Better yet, formulate your
descriptions in a way that some options may be used more than once. In this case, you maintain
the plausibility of all options for every description.
Keep the options plausible for each description

Because there is only one and the same list of options for each of the descriptions, it is
vital that you keep the options plausible for every description. It means that if you have ten
descriptions and twelve options, one option is keyed for each description and the other eleven
should be plausible distractors. Usually, if the rule on homogeneity is very well observed, it is
relatively easy to maintain one list of plausible options for each description. In addition to this,
never establish a systematic sequence of keyed responses, such as coding with a word, such as
G-O-L-D-E-N-S-T-A-R, which means that the keyed response letter for the first description is
“G” and the keyed response for the 10th description is “R.” If this pattern is initially detected by
the students, such as G-O-L- _ -E- _ -S-T- _ -R, they immediately jump into guessing that the
missing letters are D, N, and A, respectively (and guessing it right).
Place the whole test in the same page of the test paper
After stating the instructions for a matching test, write the lists or columns below it and
make sure all descriptions and options are written on the same page where the matching test is
placed in the test paper. Never attempt to extend some items or options in the next page of the
test paper because, if you do so, students will keep flipping between pages as they respond to
your matching items. If you notice in your draft that some items already go to the next page, you
do some simple adjustments, like reducing the font size of your items, as long as it remains
legible, or improve the efficiency of your test layout. If the problem still exists, shorten your list,
or if there are other types of test in your test paper, decide to switch your matching test with your
other test.
The use of selected-response tests is effective in various types of learning intents and
assessment contexts. With careful design, these tests can measure capabilities beyond those
lower-order kinds, especially if the items are formulated to elicit students’ higher levels of
cognitive skills (Popham, 2004).
145
Lesson 3
Designing Constructed-Response Types
Another set of options for the types of pen-and-paper test to give is the constructed-
response test. Unlike the selected-response types, the constructed-response test does not provide
students with options for answers, but rather require students to produce and give a relevant
answer to every test item. Drawing from its name, we understand that, in this type of test,
students construct their response, instead of just choosing it from a given list of alternatives.
Constructed-response methods of assessment include certain types of pen-and-paper tests
and performance-based assessments. In this chapter we focus our discussion only on constructed-
response types of pen-and-paper test. Some of the common types of pen-and-paper constructed-
response test are short-answer and essay.
Short-answer Items
As the name suggests, short-answer items allow students to provide short answers to the
questions or descriptions. This type of constructed-response test calls for students to respond to
either a direct question, a specific description, or an incomplete sentence by supplying a word, a
phrase, or a sentence. If a test contains direct questions, students are expected to answer the
question by giving a word, a symbol, a number, a phrase, or a sentence that is being asked. The
same applies to items using specific descriptions of words, phrases, or sentences. Items
composed of incomplete sentences ask students to complete the every sentence by supplying the
word or phrase that should meaningfully complete the sentence in terms of the assessment
domain.
In formulating questions or descriptions that compose your test items, it is important to
always think according to the name of this test type, so that you are mindful that the items should
call for “short answers.” Do not dare to ask questions that require long answer, otherwise you are
using the short-answer items as essay items. If your assessment target calls for students to
response with longer statements or written discussions, it is preferable that you give essay items
instead of short-answer items.
Make instructions explicit

Short-answer items usually have simple instructions. In fact, it is tempting to just expect
that students understand how to go about the test using only their common sense. However, it is
always unsafe to assume that every student understand what you want them to do with your test.
Besides, it is always advisable that you give your students the necessary prompt before they
respond to the test items. In short, you need to set clear instructions even for short-answer items,
which should indicate the content area, the task, and scoring. In directing students on the task
you expect them to do, specify if they answer the question, indicate what is described, or
146
complete the sentences, depending on your item’s format. Lastly, remember to say “please” and
“thank you” in your instructions.
Decide on the item’s format

When you decide to use short-answer items, also decide if all your items should come in
questions, descriptions, or incomplete sentences. Whichever you decide to use to format your
items, maintain consistency of the format for all your short-answer items. For example, if you
wish to give a 15-item short-answer test and expect that students supply short answers to your
questions, have all your items of the test written in a direct question form. Never mix up direct
question items with descriptions or incomplete sentences. One important criterion for choosing
what format to use is the age of the student. For younger learners, it is usually preferable to use
direct questions than descriptions or incomplete sentences. Once you already make up your mind
as to the item format, walk through your way to formulating each item.
Structure the items accordingly

Because short-answer items call for “short answer” as may be inferred from its name,
always make sure you structure every item in a way that it requires only a brief answer (i.e., a
word, a symbol, a number [or a set of numbers], a phrase, or a short sentence). This is achieved
by formulating very clear, specific, explicit, and error-free statements in your items. A clear and
specific question calls for a specific answer. If your description clearly and explicitly represents
the object that is described, and you are sure that it refers to a specific word, symbol, or phrase,
then your item is structured properly. If your items are incomplete sentences, structure every
item so that the missing word or phrase is a keyword or a key idea. Ordinarily, an incomplete
sentence has only one blank which corresponds to one missing keyword. You may want to
remove 2 keywords as long as it does not distort the key idea of the incomplete sentence which
should guide the students in figuring out the missing words. Never go more than 2 blanks.
One important reason why we need to ensure that students supply only brief responses is
because we make sure that responses are easy to check objectively. We encounter a major
problem related to scoring if students’ responses are lengthy. With long responses, it is difficult
to give accurate scores. Of course, we already know, as discussed in Chapter 3, that inaccurate
scoring of students’ responses in the test undermines the reliability of our measures, and reduces
the validity of the inference we make on our students’ learning outcomes.
Provide the blanks in appropriate places
Blanks are spaces in the items where students supply their answers by writing a word, a
symbol, a number, a phrase, or a sentence. If your items are all in a direct question format where
each question begins with an item number, place the blanks on the left-side of the item number.
When you type the item, begin with the blank space, followed by the item number, then the
question. This rule also applies to items using explicit descriptions. If you are using the
incomplete sentence format for your items, place the blank near the end of the sentence. This
means that you take out a keyword that is found near the end of the sentence so that it becomes
147
an incomplete sentence. Never take out a keyword from the beginning of a sentence. The reason
for this is that you need to first establish the key idea of the sentence so that students
immediately know what is missing in the sentence right after one reading. If the blank space is
near the beginning of the sentence, students will find it hard to understand the key idea and will,
therefore, read the sentence more than once in order to figure out the missing word. In all item
formats, always maintain the same length of the blanks in all your short-answer items.
The good thing short-answer items is that students really produce a correct answer rather
than merely selecting one from a set of given alternatives. In this case, if students only possess a
partial knowledge of the subject matter, which usually works with selected-response items, they
will find short-answer items difficult to give a correct response to every item. Although we
generally recognize that these types of items are appropriate for measuring simple kinds of
learning outcomes, they are capable of measuring various types of challenging outcomes if the
items are carefully developed. However, it is not advisable that you force yourself to use short-
answer items to measure more complex and deeper levels of cognitive processes. It is always
helpful that you know other methods of assessment so that you have a wide range of options
where you can freely navigate yourselves depending on your assessment purposes.
A. Think of a specific subject matter in your field of specialization, one that

you are very familiar with.
B. Write a learning intent that can be measured using short-answer test.
C. Formulate at least 5 items using one of the suggested formats.
D. Check the quality of your output based on the guidelines discussed above.
As you do this, monitor your learning as well as your confusions, doubts
or questions.
E. Raise questions in class.
Essay Items
Relative to our learning intents, there are times when it is necessary that our students
supply lengthy responses so that they exhibit more complex cognitive processes. For some
learning targets, a single word, a phrase or, a sentence is not enough to measure students’
learning outcomes. For these targets, we need a constructed-response type of test that will allow
students to adequately exhibit their learning through sufficient writing; hence, essay items work
for these purposes.
148
Just like short-answer items, essay items call for students to produce rather than select
answers from the given alternatives. But unlike short-answer items, essay items call for more
substantial, usually lengthy response from students. Because the length and complexity of the
response may vary, essay items are appropriate measures of higher-level cognitive skills.
Following are some guidelines that will help you formulate good essay items.
Communicate the extensiveness of expected response

By reading the essay item, your students must know exactly how brief or extensive their
responses should be. This is made possible by making your item clearly convey the degree of
extensiveness you expect from their response. Extensiveness depends on the degree of
complexity of your item. To determine the degree of complexity you desire to assess, you may
design an essay item according to any of the two types, depending, of course on your assessment
objective– the restricted-response and extended-response items. If you wish to measure students’
ability to understand, analyze, or apply certain concepts to new contexts while dealing with
relatively simple dimensions of knowledge, and if the task requires only a relatively short time
period, the restricted-response type may be preferred. If, however, you wish to assess students’
capability to evaluate or synthesize various aspects of knowledge, which will naturally require
longer time for their responses to be completed, the extended-response type is preferable. Notice
that even at this phase of determining the degree of complexity of your essay item, it is very vital
that you clearly make a decision based on your learning intent. This phase is crucial because if
you design an essay item that is of extended type but give it to your students as if it is of
restricted type, your students’ failure to meet the assessment standards set for the item may not
be due to their level of learning, but rather because they needed more time to gather and process
information before they could come up with responses that are relevant to your assessment
standards. Your inference on students’ learning becomes problematically unreliable and invalid.
Equally problematic your inference becomes if you construct a truly restricted type of essay item
but give it as if it is an extended-type essay item.
Prime and prompt students through the item

Unlike the other types of pen-and-paper tests, an essay item already includes the context,
assessment task, and assessment focus standards, altogether. The statement of context provides a
background of the subject matter in focus, and primes the students’ thinking of that subject
matter. The prime helps students to be selectively attentive to a subject matter that is relevant to
the assessment task of the essay item. Without it, students tend to grapple with understanding the
subject matter that is embedded in the statement of assessment task, and may find it difficult to
stay in focus. The assessment task is what the students directly respond to in order to write an
essay. Both the statements of context (or the prime) and the assessment task (or the prompt) are
important in setting the students’ attention to the subject matter and in making them think of a
response that meets the assessment standards. Notice, for example, that if the item is phrased as
“Compare and contrast the governance of Estrada and Arroyo,” students first struggle to
generate some ideas related to these two names, then think of governance or political
administrations of the two Philippine presidents in general sense. It is because the item does not
149
have a prime. In this case, the item is not helping the student stay in a clear focus of what the
item really intends to assess. It will be different if the item begins with a prime, such as when
phrased something like, “Our country has been run by a number of presidents already, and
along with the change in political administration are the changes in the agenda of reforms.
Compare and contrast the economic reform agenda of the presidential administrations of
Estrada and Arroyo.” In this item, students are primed to think of the reform agenda on the two
presidents, which is very probable that they focus more on the context as they respond to the
assessment task. This latter example is not yet a complete essay item as it lacks other necessary
elements, but it clearly shows how effectively you can prime and prompt students to
appropriately respond to your test item. This item may be improved to become a full-blown
essay item if you add other elements, such as the guide to the extensiveness of the desired
response as well as the assessment standards.
Provide clear assessment standards

You might think that, if it has both the prime and prompt, you item can already stand.
This is not true. For an essay item to stand as a good one, it must also indicate a clear guide for
the value of the item. The assessment standards inform the students about what specific aspects
of their responses you will give merit, and what aspects will earn more credit than the others. If,
for example, you give credit to their argument if they can provide an evidence, then you need to
categorically ask for it in your essay item. Similarly, if you give two or three essay items and you
wish to give more credit to one item based on its complexity, you also need to indicate the item’s
value. This way, students know when and where they devote most of their time and effort, and
decide how much of these resources will be invested to each item. One simple way of guiding
students in term of the item’s value is to indicate the assessment weight you assign to the item in
parentheses at the end of the item.
Do away with optional items

While reading this part you may be recalling a common experience in taking an essay test
when the teacher asked you to choose some, but not all, essay items to answer, and that you
tended to choose those items that were more convenient to your understanding and readiness.
This practice of providing optionality in essay items where students are made to choose fewer
items to answer than what is presented should be stopped. From what you may recall in your
experience, it is obvious that, when students are free to choose only a few items to answer, they
would choose those items that are easy to them. As a consequence, each student will be choosing
items that are “easy” to them, and, thus leads to flawed inference of students’ learning because
students’ responses are marked under different standards and levels of complexity, depending on
the items they chose to answer. One of the basic questions you will need to answer if you plan to
do this is, What is the assurance that all your items have equal level of complexity and that they
measure exactly the same knowledge domains and cognitive processes? This question is
extremely difficult to answer. This guideline, therefore, says that if, for example, you have 3
essay items for the test you are about to administer, have each of your students answer all the 3
items.
150
Prepare a scoring rubric

Because an essay item calls for relatively extensive response from the students, it is
always necessary that you prepare a scoring rubric or guide prior to giving the test. The scoring
scheme will help you pre-assess the validity and reliability of your item because it will allow you
to identify the criteria as well as the key ideas you expect your students to give in response to the
item. Your scoring rubric indicates the descriptions in scoring the quality of your students’
responses in the essay item. It includes a set of standards that define what is expected in a
learning situation, and important indicators of how students’ responses to the task will be scored.
Having said this, we ask you to choose the scoring approach that will best fit your assessment
context. You have two options for this purpose. One is the holistic approach, another is the
analytic approach.
The holistic approach allows you to focus on your students’ overall response to an essay
item. As you assess the response as a whole, this approach will guide you in terms of what
dimensions of the learning outcome you pay attention to. For example, if your essay item intends
to let students manifest their ability to argue with appropriate evidence, and explain in good,
clear, and coherent language, you need to identify the dimensions that can capture those abilities
in you assessment; hence you may have the following dimensions indicated in your holistic
rubric: Logic of the argument, Relevance of evidence, Communicative clarity, Lexical choice,
and Mechanics (spelling, punctuations, etc.). These dimensions serve as your criteria for
assessing students’ response. It is always appropriate that you indicate the dimensions to assess
because these dimensions keep you in focus as you assign a score to each of your students’
response. And for you to be guided further in terms of how much score to give, each dimension
must be assigned a corresponding point or set of points. For example, you wish to give a
maximum of 6 points for the logic of the argument, indicate it in your holistic rubric so that it
might look the items in the box below.
Assessment Criteria Points:
• Logic of the argument 6

• Relevance of evidence 4
• Communicative clarity 3
• Lexical choice 2
• Mechanics (spelling, punctuations, etc.). 2
Another way of setting a guide to scoring is by way of assigning the same points for each
criterion but you also indicate the weight of each criterion based on its importance or value. The
box below gives you a view of how the contents may look like.
151
Assessment Criteria Points Weight
• Logic of the argument 5 40%

• Relevance of evidence 5 35%
• Communicative clarity 5 15%
• Lexical choice 5 5%
• Mechanics (spelling, punctuations, etc.). 5 5%
When employing a holistic approach for scoring students’ responses in an essay item,
your decision as to how much score to give based on each dimension is not guided by clear
descriptions of the quality of response. It usually rests on the teacher’s judgment of the student’s
response in terms of each criterion. Because this approach does not require specific descriptions
of the quality of response, it is easy and efficient to use. The major weakness of this approach,
however, is the fact that it does not specify the graded levels of performance quality which
invites teachers’ subjective judgment of students’ response. Acknowledging its major weakness,
we recommend that you use the holistic approach only for restricted-response items where
students are tested only on less complex skills requiring only a small amount of time.
In contrast, the analytic approach allows for a more detailed and specific assessment
scheme in that it indicates not only the dimensions or criteria, but also the specific descriptions
of the different levels of performance quality per criterion. Supposing we take the sample criteria
in the boxes above and use them as the same criteria for our analytic rubric, we proceed by
determining the levels of performance quality for each criterion. For the logic of the argument
criterion we set a scale of varying performance quality, perhaps ranging from Excellent to Poor,
with other levels of quality in between. A simple way to do this is exemplified in the box below.
Assessment Criteria Scale Indicators

Excellent Satisfactory Fair Poor
(8 pts) (6 pts) (4 pts) (2 pts)
• Logic of the argument (40%) ____ ____ ____ ____
• Relevance of evidence (35%) ____ ____ ____ ____
• Communicative clarity (15%) ____ ____ ____ ____
152
• Lexical choice (5%) ____ ____ ____ ____

• Mechanics (5%) ____ ____ ____ ____
As indicated in the box above, there are 4 scale indicators, each representing a level of
performance quality. In this example, the teacher will put a check on the space below the scale
indicator that matches the quality of a student’s response on every criterion. Scores are obtained
by assigning points in every scale indicator. You may also specify the weight of each criterion
depending on the degree of importance or value of the criterion.
A more calibrated analytic rubric not only indicates the scale levels for the teacher to
check against the quality of students’ response in an essay item, but also describe that
performance quality that falls under each level of the scale. This rubric describes what quality of
performance will qualify as “excellent” and what type of performance is “poor.” In this case, the
analytic rubric should include descriptive statements for each scale level of each criterion. The
table below shows an example of these descriptive statements applied to one of the criteria we
used in the above example, just to illustrate the point.
Criterion Scale Indicators
Excellent Satisfactory Fair Poor

(7-8 points) (5-6 points) (3-4 points) (1-2 points)
Logic of the Argument is Argument is Some Assumptions

argument clearly premised premised on valid assumptions are are generally
on valid assumptions with weak and the too weak and
assumptions and logical sequence argument is not the argument is
is logically only in some completely problematic.
sequenced. parts. logical.
The good thing about using the analytic approach in scoring essay responses is that it
helps you identify the specific level of students’ performance, and your assessment of students’
learning outcomes is objective. Therefore, it increases the reliability of your measure and will
facilitate more valid and reliable inference. It is also beneficial for the students because, through
the analytic rubric, they can pinpoint the specific level of their performance, and can judge its
quality by matching it against the descriptions. This type of rubric is best for essay items that
measure more complex cognitive skills and more sophisticated knowledge dimensions.
Whichever approach you wish to use for scoring your students’ response to your essay
items, your decision will work if you are already clear on the following questions:
153
• What do you want your students to know and be able to do in the essay?
• How well do you want your students to know and be able to do it in the essay?
• How will you know when your students know it and do it well in the essay?
As you clarify your practice with reference to those questions, walk your way to
constructing your scoring scheme using any approach, and following the simple steps indicated
below.
ü Set appropriate assessment target.
ü Decide on the type of the rubric to use.
ü Identify the dimensions of performance that reflect the learning outcomes.
ü Weigh the dimensions in proportion to their importance or value.
ü Determine the points (or range of points) to be allocated to each level of performance.
ü Show the rubric with colleagues and/or students before using it.
Some teachers are excited to use essay items because these items provide more
opportunities to assess various types of learning outcomes, particularly those that involve higher
level cognitive processing. If carefully constructed, essay items can test students’ ability to
logically arrange concepts, analyze relationships between them; state assumptions or compare
positions, evaluate them and draw conclusions; formulate hypotheses and argue on the causal
relationships of concepts, organize information or bring in evidences to support some findings;
propose solutions to certain problems and evaluate the solutions in light of certain criteria. These
and much, much more competencies can be measured using good essay items.
References
Airasian, P. W. (2000). Assessment in the classroom: A concise approach. 2nd edition. USA:
McGraw-Hill Companies.
Popham, W. J. (2005). Classroom assessment: What teachers need to know. 4th edition. Boston,
MA: Allyn and Bacon.
154
Chapter 5
Constructing Non-Cognitive Measures
Objectives
1. Follow the procedures in constructing an affective measure.

2. Determine techniques in writing items for non-cognitive measures.
3. Use the appropriate response format for a scale constructed.
4. Give the uses of non-cognitive tests.
Lessons
1 What are non-cognitive constructs?

2 Steps in constructing non-cognitive measures
3 Response Formats
155
Lesson 1
What are Non-Cognitive Construct
Human behavior is composed of multiple dimensions. Behaviors are characteristics in

which one thinks, feels, and acts as people interact with their environment. The previous sections
emphasized on techniques in assessing the cognitive domain as applied in creating teacher-made
tests and analyzing the test using either the Classical Test Theory or the Item Response Theory
approach. This chapter guides you in the construction of measures in the affective domain.
Anderson (1981) explained affective characteristics as “qualities which presents people’s typical
ways of feeling, or expressing their emotions” (p. 3). Sta. Maria and Magno (2007) found that
affective characteristics run on two dimensions: Intensity and direction. Intensity refers to the
strength of the characteristic expressed. The direction of affect refers to the cause of the affect
from object external factors to person factors. Examples of intensity reflect high scores on
certain affective measures such as aggression scales and motivation scales. The direction will be
the cause of aggressive whether it is from an external person pr the self, whereas for motivation,
the cause may be internal (ability) or material factors such as rewards.
Figure 1
Dimensions of Affect
High Intensity
Person Object
Low Intensity
156
Affective characteristics are further classified according to specific variables such as

attitudes, beliefs, interest, values, and dispositions.
Attitude. Attitudes are learned predispositions to respond in a consistently favorable or

unfavorable manner with respect to a given object (Meece et al. 1982). According to Meece et al.
(1982) Attitude is related to academic achievement since attitudes are learned over time by being
in contact with the subject area. Information about the subject area is received through
instruction and consequently attitude is developed. Moreover, if a person is favorably
predisposed toward an academic course, that favorable disposition should lead to favorable
behaviors like achievement.
According to Bandura (1977), attitude is often used in conjunction with motivation
to achieve. It is how capable people judge themselves to perform a task successfully. Moreover
extensive evidence and documentation were provided for the conclusion that attitude is a key
factor in the extent to which people can bring about significant outcomes in their lives.
According to Overmier and Lawry (1979) that one potential source of the drive to
perform is the incentive value of the performance. Incentive theories of motivation suggest that
people will perform an act when its performance is likely to result in some outcome they desire,
or that is important to them. For example, in anticipation of a situation in which a person is
required to perform, that person may expend considerable effort in preparation because of the
mediation provided by the desire to achieve success or avoid failure. That desire would be said to
provide incentive motivation for the person to expend the effort. Accordingly, a test, as a
stimulus situation, may be theorized to provoke students to study as a response, because of the
mediation of the desire to achieve success or avoid failure on that test. Studying for the test,
therefore, would be the result of incentive motivation.
In more objective terms attitude may be said to connote response consistency with
regards to certain categories of stimuli (Anastasi, 1990). In actual practice, attitude has been
most frequently associated with social stimuli and with emotionally toned responses (Anastasi
1990).
Zimbardo and Leippe (1991) defined attitude as favorable or unfavorable evaluative
reactions whether exhibited in beliefs, feelings, or inclinations to act toward something.
According to Myers (1996) attitude is commonly referred to beliefs and feelings related to a
person or event and their resulting behavior. Attitudes are an efficient way to size up the world.
This means that when individuals have to respond quickly to something, the feeling can guide
the way one reacts. Psychologists agree that knowing people’s attitude is to predict their actions.
Attitudes involve evaluations. Attitude is an association between an object and our evaluation of
it. When this association is strong, the attitude becomes accessible. Encountering the object calls
up the associated evaluation towards it. One acquires attitude in a manner that makes them
sometimes potent, sometimes not. An extensive series of experiments shows that when attitudes
arise from experience, they are far more likely to endure and to guide actions. She concluded that
attitudes predict actions if other influences are minimized, if it is specific to the action and it is
potent.
An example of an attitude scale is the “Attitude Towards Church Scale” by Thurstone
and Chave (1929). The scale measure the respondents position on a continuum ranging from
strong depreciation to strong appreciation of the church. It is composed of 45 items. A split-half
measure of .848 was obtained and corrected by Spearman-Brown formula became .92.
Discriminant validity was conducted where participants were classified according to their
157
religion where the catholic group had the highest mean score. In another discriminant validity the
participants who frequently attended church had the highest mean. Example of items are:
1. I think the teaching of the church is altogether too superficial to have much social
and significance.
2. I feel the church services give me inspiration and help me to live up to my best during
the following week.
3. I think the church keeps business and politics up to higher standard than they would
otherwise tend to maintain.
Beliefs. Beliefs are judgments and evaluations that we make about ourselves, about
others, and about the world around us (Dilts, 1999). Beliefs are generalizations about things such
as causality or the meaning of specific actions (Pajares, 1992). Examples of belief statements
made in the educational environment are “A quiet classroom is conducive to learning,”
“Studying longer will improve a student’s score on the test,” “Grades encourage students to work
harder.”
Beliefs play an important part in how teachers organize knowledge and information and
are essential in helping teachers adapt, understand, and make sense of themselves and their world
(Schommer, 1990; Taylor, 2003; Taylor & Caldarelli, 2004). How and what teachers believe
have a tremendous impact on their behavior in the classroom (Pajares, 1992; Richardson, 1996).
An example of a measure of belief is the Schommer Epistemological Questionnaire.
Schommer (1990) developed this questionnaire to assess beliefs about knowledge and learning.
A 21-item questionnaire was developed by the researchers to measure epistemological beliefs of
Asian students. The questionnaire was adapted from Schommer's 63-item epistemological beliefs
questionnaire. This Asian version of the Schommer Epistemological Questionnaire has been
validated with a sample of 285 Filipino college students. This epistemological questionnaire was
revised to have lesser items, and simpler expression of ideas to be more appropriate for Asian
learners. The number of statements was reduced to ensure that the participants would not be
placed under any stressed while completing the questionnaires. Students are asked to rate their
degree of agreement for each item on a 5-point Likert scale ranging from 1 (strongly disagree) to
5 (strongly agree). Wording of items varied in voice from first person (I) to third person
(students) in an effort to illustrate how the same belief could be queried from somewhat different
perspectives. Items assessed four epistemological belief factors including beliefs about the
ability to learn (ranging from fixed at birth to improvable), structure of knowledge (ranging from
isolated pieces to integrated concepts), speed of learning (ranging from quick learning to gradual
learning), and stability of knowledge (ranging from certain knowledge to changing knowledge).
Schommer (1990) has reported reliability and validity testing for the Epistemological
Questionnaire; the instrument reliably measures adolescents' and adults' epistemological beliefs
and yields a four-factor model of epistemology. Schommer (1993) has reported test-retest
reliability of .74. Factor analyses were conducted on the mean for each subset, rather than at the
item level.
Interests. Interest generally refers to individual's strengths, needs, and preferences.

Knowledge of one’s interest strengthens understanding of career decision making and overall
development. Strong (1955) defined interests as "a liking/disliking state of mind accompanying
158
the doing of an activity" (p. 138). Interests may be referred to as instrumental means to an end
independent of perceived importance (Savickas, 1999).
According to Holland’s theory, there are six vocational interest types. Each of these six
types and their accompanying definitions are presented below:
Realistic People with Realistic interests like work activities that include practical,
hands-on problems and solutions. They enjoy dealing with plants, animals,
and real-world materials like wood, tools, and machinery. They enjoy outside
work. Often people with Realistic interests do not like occupations that
mainly involve doing paperwork or working closely with others.
Investigative People with Investigative interests like work activities that have to do with
ideas and thinking more than with physical activity. They like to search for
facts and figure out problems mentally rather than to persuade or lead people.
Artistic People with Artistic interests like work activities that deal with the artistic
side of things, such as forms, designs, and patterns. They like self-expression
in their work. They prefer settings where work can be done without
following a clear set of rules.
Social. People with Social interests like work activities that assist others and
promote learning and personal development. They prefer to communicate
more than to work with objects, machines, or data. They like to teach, to give
advice, to help, or otherwise be of service to people.
Enterprising People with Enterprising interests like work activities that have to do with
starting up and carrying out projects, especially business ventures. They like
persuading and leading people and making decisions. They like taking risks
for profit. These people prefer action rather than thought.
Conventional People with Conventional interests like work activities that follow set
procedures and routines. They prefer working with data and detail rather than
with ideas. They prefer work in which there are precise standards rather than
work in which you have to judge things by yourself. These people like
working where the lines of authority are clear.
Examples of affective measures of interest are the Strong-Campbell Interest Inventory and
Strong Interest Inventory (SII), Jackson, Vocational Interest Inventory, Guilford-Zimmerman
Interest Invmetory, Kuder Occupational Interest Survey. For a list of vocational interest tests,
visit the site: http://www.yorku.ca/psycentr/tests/voc.html.
Values. Values refer to “the principles and fundamental convictions which act as general
guides to behavior, the standards by which particular actions are judged to be good or desirable
(Halstead & Taylor, 2000, p. 169). These values are used as guiding principles to act and justify
accordingly (Knafo & Schwartz, 2003). The values are internalized and learned at an early stage
in life. The school setting is one major avenue where people show how the values are learned,
respected and uphold. A student who strives for valuing education in school are provided with
opportunities to behave in ways that will allow him to do well in school and thus attain values of
hard work, perseverance and diligence when it comes to academic-related tasks. Examples of
values are diligence, respect for authority, emotional restraint, filial piety, and humility.
159
An example of a measure of values is the Asian Values Scale-Revised (AVS-R). The

AVS-R is a 25-item instrument designed to measure an individual’s adherence to Asian cultural
values the enculturation process and the maintenance of one’s native cultural values and beliefs,
(Kim & Hong, 2004). In particular, the AVS-R assesses dimensions of Asian cultural values,
which include, “collectivism, conformity to norms, respect for authority figures, emotional
restraint, filial piety, hierarchical family structure, and humility”. This instrument used a 4-point
Likert scale with 1 being strongly disagree to 4 being strongly agree. A high score indicates a
high level of adherence to Asian Values while a low score would indicates otherwise. Factors
included are high expectations for achievement (e.g. One need not minimize or depreciate one’s
own achievement – reverse worded), hierarchical family structure (e.g. One need not follow the
role expectations of one’s family- reverse worded), respect for education (e.g. Educational and
career achievements do not need to be one’s top priority – reverse worded), perseverance and
hard work (e.g. One need not focus all energies on one’s studies- reverse worded), filial piety
(e.g. One should avoid bringing displeasure to one’s ancestors), respect for authority (e.g.
Younger persons should be able to confront their elders- reverse worded), emotional restraint
(e.g. One should have sufficient inner resources to resolve emotional problems), and finally,
collectivism (e.g. One should think of one’s group before himself). The AVS-R has the
reliability of .80 and internal consistency coefficients of .81 and .82. Apart from this, a 2-week
test-retest reliability of .83 was obtained. Construct validity was obtained by identifying, via a
nationwide survey and focus groups discussions whereas, concurrent validity was obtained
through confirmatory factor analyses, in which a factor structure comprising the AVS, the
Individualism-Collectivisn Scale (Triandis, 1995), and the Suinn-Lew Asian Self Identity
Acculturation Scale (SL-ASIA: Suinn, Rickard–Figueroa, Lew & Vigil, 1987) was confirmed.
Discriminant validity was evidenced in a low correlation between the AVS scores, which reflect
values enculturation and the SL-ASIA scores, which reflect predominantly behavioral
acculturation.
Dispositions. The National Council for Accreditation of Teacher Education (2001), which
stated that dispositions are: the values, commitments, and professional ethics that influence
behaviors toward students, families, colleagues, and communities and affect student learning,
motivation, and development as well as the educator's own professional growth. Dispositions are
guided by beliefs and attitudes related to values such as caring, fairness, honesty, responsibility,
and social justice. Examples of dispositions include fairness, being democratic, empathy,
enthusiasm, thoughtfulness, and respectfulness. Disposition measures are also created for
metacognition, self-regulation, self-efficacy, approaches to learning, and critical thinking.
Activity
Use the internet and give examples of affective scales under each of the following areas.
Attitudes Beliefs Interest Values Dispositions

160
Lesson 2
Steps in Constructing Non-Cognitive Measures
The steps involved in constructing an affective measure follow an organized sequence of

procedures that when properly done results to a good scale. Constructing a scale is a research
process where the test developer poses research questions, hypothesize on these questions, and
gather data to provide evidence on the reliability and validity of the scale.
Steps in Constructing a Scale
Decide what information should be sought
The construction of a scale begins with clearly identifying what construct needs to be
measured. The basis of constructing a scales is when (1) no scales are available to measure such
construct, (2) all scales are foreign and it is not suitable for the stakeholders or sample that will
take the measure, (3) existing measures are not appropriate for the purpose of assessment, (4) the
test developer intends to explore the underlying factors of a construct and eventually confirm it.
Once the purpose of developing a scale is clear, the test developer decides what type of
questionnaire to be used. Decide whether the measure will be an attitude, belief, interest, value,
and disposition.
When the specific variable construct is clearly framed, it is very important that the test
developer search for relevant literature reviews from different studies involving the construct
intended to be measured. What is needed in the literature review is the definition that the test
developer wants to adapt and whether the construct has underlying factors. The definition and its
underlying factors is the major basis for the test developer later on to write the items. Having a
thorough literature review helps the test developer to provide a conceptual framework as basis
for the construct being measured. The framework can come in the form of theories, principles,
models, and a set of taxonomy that the test developer can use as basis for hypothesizing factors
of the construct intended to measure. Having a thorough knowledge of the literature about a
construct helps the researcher identify different perspectives on how the factors were arrived and
possible problems with the application of these factors across different groups. This will help the
test developer justify the purpose of constructing a scale.
When the constructs and its underlying factors or subscales are established through
thorough literature review, a plan to make the scale needs to be designed. The plan starts with
creating a Table of Specifications. The Table of Specifications indicates the number of items for
each subscale, the items phrased in positive and negative statements, and the response format.
Write the first draft of items
The test developer uses the definitions provided in the framework to write the
preliminary items of the scale. Items are created for each subscale as guided by the conceptual
definition. The number of items as planned in the Table of Specifications is also considered. As
much as possible, a large number of items are written to represent well the behavior being
measured. In helping the test developer write some items, a well represented set of behaviors
manifesting the construct should be covered. Qualitative studies reporting the specific responses
are very helpful in writing the items. An open-ended survey, focus group discussion, and
161
interviews can be conducted in order to come up with statements that can be used to write items.
When these methods are employed as a start of item writing, the questions generally seeks for
specific behavioral manifestations of the subscales intended to measure. An example would be
the study of Magno and Mamauag (2008) where they created the “Best Engineering Traits”
(BET) that measures dispositions of engineering students in the areas of assertiveness,
intellectual independence, practical inclination, and analytical interest. The items in this scale
were based on an open-ended survey conducted among engineering students. The survey asked
the following questions:
1. How do you show your expertise in different situations as an Engineering student?
2. How do you apply engineering theories in your everyday life?
3. What are the instances that an Engineer needs to be assertive?
4. In what ways can an Engineer be independent in his intellectual thinking?
5. What do you think are other personality traits or characteristics that would make you an
effective engineer?
Example of item statements generated from the survey responses are as follows:
1. I like watching repairmen when they are fixing something.

2. I gather necessary information before making decisions.
3. I hate to buy things in hardware stores.
4. I do not rely on mathematical solutions in arriving at conclusions.
Notice that the item statements begin with the pronoun “I.” This indicates self-referencing for the
respondents when they answer the items. Items 1 and 2 in the example are stated in a positive
statement while items 3 and 4 are stated in negative. This ensures that respondents would be
consistent with their answers in a subscale where the items should be responded in the same way.
For negative items, reverse scoring is done with the responses to be consistent with the positive
items. The following are guidelines in writing good items:
Good questionnaire items should:

1. Include vocabulary that is simple, direct, and familiar to all respondents
2. Be clear and specific
3. Not involve leading, loaded or double barreled questions
4. Be as short as possible
5. Include all conditional information prior to the key ideas
6. Be edited for readability
162
Example of bad items:
I am satisfied with my wages and hours at the place where I work. (Double Barreled)
I not in favor congress passing a law not allowing any employer to force any employee to retire
at any age. (Double Negative)
Most people favor death penalty. What do you think? (Leading Question)
Select a scaling technique
After writing the items, the test developer decides on the appropriate response format to
be used in the scale. The most common response formats used in scales are the Lickert scale
(measure of position in an opinion), Verbal frequency scale (measure of a habit), Ordinal scale
(ordering of responses), and the Linear numeric scale (judging a single dimension in an array). A
detailed description of each scaling technique is presented in the next lesson.
Develop directions for responding
It is important that directions or instructions for the target respondents be created as early
as when the items are created. When making instructions, it is very important that it is clear and
concise. Respondents should be informed how to answer. When you intend to have a separate
answer sheet, make sure to inform the respondents about it in the instructions. Instructions
should also include ways of changing answers, how to answer (encircle, check, or shade). Inform
the respondents in the instructions specifically what they need to do.
The following are the instructions formulated for the BET:
This is an inventory to find out your suitability to further study Engineering. This can help guide you in
your pursuit of an academic life. The inventory attempts to assess what interests and strategies you have
learned or acquired over the years as a result of your study.
In the inventory, you will find statements describing various interests and strategies one acquires through
years of schooling and other learning experiences. Indicate the extent of your agreement or disagreement
to each of these statements by using the following scale:
4 STRONGLY AGREE (SA)

3 AGREE (A)
2 DISAGREE (D)
1 STRONGLY DISAGREE (SD)
There are no right or wrong answers here. You either AGREE or DISAGREE with the statement. It is
best if you do not think about each item too long --- just answer this test as quickly as you can, BUT
please DO NOT OMIT answering any item.
DO NOT WRITE OR MAKE ANY MARKS ON THE TEST BOOKLET. All answers are to be written on
your answer sheet.
Ensure that you have filled out your answer sheet properly and legibly for your name, school, date of
birth, age, and gender.
163
Be sure also that you have copied correctly you test booklet number on the space provided in your
answer sheet. Do not turn the page until you are told to do so.
You have a total of 40 minutes to finish this whole test. Do not spend a lot of time in any one item.
Answer all items as truthfully and honestly as you can.
Notice that the instruction started with the purpose of the test. This is done to dispel any
misconceptions that the respondents think about the test. Then the instruction describes the kind
of items expected for the test. Then the respondent is told how to answer the items. The scaling
technique is also provided. The respondents are reminded that there are no right or wrong
answers to avoid faking good or bad in the test. The respondents are reminded such as not
making any marks on the test booklet, use of answer sheets, answering all items and the time
allotment. As much as possible, detailed instructions are provided to avoid any problems.
Conduct a judgmental review of items
For achievement tests and teacher made tests, this procedure is called content validation.
But for affective measures, it would be difficult to conduct content validation because there is no
available content area for an affective variable. The definition and behavioral manifestations
from empirical reports can qualify for the areas measured. Instead the items are reviewed
according to the definition or framework provided whether they are relevant, not within the
confines of the theory, measuring something else, applicability of the target respondents, and
whether it needs revision for clarity.
Item review is conducted among experts in the content being measured. In the process of
item review, the together with the constructed items, the conceptual definition of the constructs
are provided to guide the reviewer to ensure that the items are framed. It is also necessary to
arrange the items according to each subscale where is belongs so that the reviewer can easily
evaluate the appropriateness of the items in that subscale. A suggested format for item review is
shown below:
Practical Inclination – finding meaning about concepts and

adapt to, shape, and select environments covering a wide range
of applications (Sternberg, 2004). Application, putting into Suggested
practice, use knowledge, implement, propose something new. Accept Reject Revise Revision
1. I like fixing broken things in the house.
2. I help out my father fix broken things in the house.
3. I help do the manual computation if there is no available
calculator.
4. I help my friends organize their schedule if they do not know
what to consider.
When giving items for review, the test developer write a formal letter to the reviewer and
indicate specifically how do you want the review to be done. Indicate specifically if you also
intent to review the grammar of the statement because most reviewers would just focus on the
content and its frame on the definition.
164
Reexamine and revise the questionnaire
After the items have been reviewed expect that there would be several corrections and
comments. Several comments indicate that the items will be better because it have been
thoroughly studied and critiqued. In fact, several comments should be more appreciated than few
because it means that the reviewers are offering better ways on how to fix and reconstruct your
items. In this stage, it is necessary to consider the suggestions and comments provided by the
reviewer. If there are things that are not clear to you, do not hesitate to go back and ask the
reviewer once more. This will ensure that the items will be better when the final form of the
scale is assembled.
Prepare a draft and gather preliminary pilot data
Preparing the items for pilot testing requires a layout of the test for the respondents. The
general format of the scale should be emphasized on making it as easy as possible to use. Each
item can be identified with a number of a letter to facilitate scoring of responses later. The items
should be structured for readability and recording responses. Whenever possible items with the
same response formats are placed together. In designing self-administered scales, it is suggested
to make it visually appealing to increase response rate. The items should be self-explanatory and
the respondents can complete it in a short time. In ordering of items, the first few questions set
the tone for the rest of the items and determine how willingly and conscientiously respondents
will work on subsequent questions.
Before going to the actual pilot test, the items can be administered first to at least 3
respondents who belong in the target sample and observe them in some areas that take them long
in answering and if the instructions are clearly followed. A retrospective verbal report can be
conducted while the participants are answering the scale to clarify any difficulties that might
arise in answering the items.
In the actual pilot testing, the scale is administered to a large sample (N=320). The ideal
number of sample would be three time the total number of items. If there are 100 items in the
scale, the ideal sample size would be 300 or more. Having a large number of respondents makes
the responses more representative of the characteristic being measured. Large sample tends to
make the distribution of the scores assume normality.
In administering a scale, the proper testing condition should be maintained such as the
absence of distractions, room temperature, proper lighting, and other aspects that can cause large
measurement errors.
Analyze Pilot data
The responses in the scale should be recorded using a spreadsheet. The numerical
responses are then analyzed. The analysis consists of determining whether the test is reliable or
valid. Techniques of establishing validity and reliability are explained in chapter 3. If the test
developer intends to use parallel forms or test-retest, then two time frames would be set in the
design of the testing.
The analysis of items would indicate whether the test as a whole or the individual items
are valid or reliable. If principal components analysis is conducted, each item will have a
corresponding factor loading, the items that do not highly load on any factor are removed from
165
the item pool. If certain items when removed would also increase the Cronbach’s alpha
reliability of the test. These techniques suggest removing certain items to improve the index of
reliability and validity of the test. This implies that a new form is produced complying with the
results of the items analysis. That’s why it is needed to have a large pool of item to begin with
because not all items will be accepted in the final form of the test.
Revise the Instrument
The instrument is then revised because items with low factor loadings are removed, items
that when removed will increase Cronbach’s alpha is also considered. In the process of principal
components analysis even though the test developer has proposed a set of factors these factors
may not hold true because items will have a different grouping. The test developer then thinks of
new factor labels for the new grouping of items. These cases necessitate the test developer to
revise the items and come up with another revised form. This revised form is again administered
to another large sample to collect evidence of the scale being valid or reliable.
Gather final pilot data
For the final pilot data gathering, a large sample is again selected which is three times the
number of items. The sample should have the same characteristics as with the first pilot sample.
The data gathered would serve to establish the final estimates of the tests validity and reliability.
Conduct Additional Validity and Reliability Analysis
The validity and reliability is again analyzed using the new pilot data. The test developer
wants to determine if the same factors will still be formed and whether the test will still show the
same index of reliability.
Edit the questionnaire and specify the procedures for its use
Items with low factor loadings are again removed resulting to less items. A new form of
the test with reduced items will be formed. The remaining items have evidence of good factor
loadings. The final form of the test can now be formed.
Prepare the Test Manual
The test manual indicates the purpose of the test, instructions in administering, procedure
for scoring, interpreting the scores including the norms. Establishing norms will be fully
discussed in the next chapter.
Activity
Think of a construct that you want to study for a research or for your thesis in the future. Follow
the steps in test construction in developing the scale.
166
Lesson 3
Response Formats
This lesson presents the different scaling techniques used in tests, questionnaires, and
inventories. The important assumption for putting scales on tests and questionnaires is to provide
quantities and figures that can be analyzed and interpreted statistically. One characteristic of
research is that it should be measurable, through scales we are able to measure and quantify
concepts under study. Scales also enable the results be analyzed by mathematical formulas to
arrive with quantities of results.
The scaling techniques discussed here can be categorized accordingly to the levels of
measurement such as nominal, ordinal, interval, and ratio. In some references, the scaling
techniques come in conjunction with the levels of measurement. The purpose of mentioning the
level of measurement is to separate them as a topic and how they are related to scaling
techniques.
According to Bailey (1996) scaling is a process of assigning numbers or symbols to
various levels of a particular concept that we wish to measure. Scales can either be open-ended
or close-ended. For open-ended questions scales refer to the criteria set in order to effectively
and objectively assess the information presented. For close-ended questions, scales refer to
response formats for certain concepts and statements. Varieties of these scales serving as a
response format on tests and questionnaires will be presented in this report.
Before I present the varieties of scaling techniques the following should be remembered
as a framework for discussion:
(1) What kind of question is this scale used for?

(2) What general behavior does this scale measure?
(3) What is the unique feature of this scaling technique? (4) What are the advantages and
disadvantages in using this scale?
Classification and Types of Scales
The scaling techniques are be classified according to three categories base on the type of
question they are used. These categories are scaling techniques for Multiple Choice Questions,
Conventional scale types used for measuring behavior on questionnaires, Scale Combinations,
Nonverbal scales for questions requiring illustrations and Social Scaling for obtaining the profile
of a group (Alreck & Settle, 1995).
MULTIPLE CHOICE QUESTIONS
Multiple choice questions are common and known for being simple and versatile. They
can be used to obtain mental ability and a variety of behavioral patterns. This is ideal for
responses that fall into discrete categories. When the answers can be expressed as numbers, a
direct question should be used, and the number of units should be recorded.
167
1. Multiple Response Item

In this scaling technique the respondents can indicate one or more alternatives, and they
are instructed to check any within the question, itself. In this case each alternative becomes a
variable to be analyzed.
Please check any type of food that you regularly eat in the cafeteria.
___ Hamburger
___ Pasta
___ Soup
___ Fried chicken
___ French fries
2. Single-Response Item
In this scaling technique one alternative is singled out from among several by the
respondent. The item is still multiple choice but only one response is required. Single response
items can be used only when (1) the choice criterion is clearly stated and (2) the criterion
actually defines a single category.
What kind of food do you most often eat in the cafeteria? (Check only one)
___ Hamburger
___ Pasta
___ Soup
___ Fried chicken
___ French fries
CONVENTIONAL SCALE TYPES
These types of scales are commonly used for surveys. Every information need or survey
question can be scaled effectively with the use of one or more of the scales. One should
remember that the decision of scaling technique is a matter of choice among the conventional
scales.
3. Lickert Scale
Used to obtain people’s position on certain issues or conclusions. This is a form of
opinion or attitude measurement. In this scale the issue or opinion is obtained from the
respondents’ degree of agreement or disagreement.
The advantage of this scale include flexibility, economy, and ease of composition. The
procedure is flexible because items can be only a few words long, or they can consist of several
lines. The method is economical because one set of instructions and scale ca serve many items.
The respondent can quickly and easily complete the items.
Also, the Lickert scale enables to obtain a summated value. Beside obtaining the results
of each item, a total score can be obtained from a set of items. The total value would be an index
of attitudes toward the major issue, as a whole.
Please pick a number from the scale to show how much you agree or disagree with each statement and
jot it in the space to the left of the item.
168
Scale
1 Strongly agree
2 Agree
3 Neutral
4 Disagree
5 Strongly disagree
____ 1. I can get a good job even if my grades are bad.

____ 2. School is one of the most important things in my life.
____ 3. Many of the things we learn in class are useful.
____ 4. Most of what I learn in school will be useful when I get a job.
____ 5. School is not a waste of time.
____ 6. Dropping out of school would be a huge mistake for me.
____ 7. School is more important than most people think.
4. Verbal Frequency Scale

The verbal frequency scale contains five words that indicate how often an action has been
taken. This scale is used to know the frequency of some action or behavior by respondents. A
straight forward question is recommended when the absolute number of times is appropriate and
required. In using the verbal frequency scale, the researcher wants to know the proportion of
percentage of activity, given an opportunity to perform it.
The advantage of using this scale is that respondents are not forced with the precision of
recollection exactly how many times they have behaved in a certain way. Another is the ease of
assessment and response by those being surveyed. It has the ability to array activity levels across
a five category spectrum for data description, and the ease of making comparisons among
subsamples or among different actions for the same sample of respondents.
A disadvantage is that it provides only a gross measure of proportion.
Please pick a number from the scale to show how often you do each of the things listed below and jot in
the space at the left.
Scale
1 Always
2 Often
3 Sometimes
4 Rarely
5 Never
___ 1. I take a brunch at the middle of breakfast and lunch.

___ 2. I take a light snack 4 hours after lunch.
___ 3. I take midnight snack.
5. Ordinal Scale
The ordinal scale is also a multiple choice item but the response alternatives don’t stand
in any fixed relationship with one another. The response alternatives define an ordered sequence.
The responses are ordinal because each time a category is listed, it comes before the next one.
The principal advantage of the ordinal scale is the ability to obtain a measure relative to
some other benchmark. The order is the major focus and not simply the chronology.
169
Ordinarily, when do you or someone in your family would read a pocket book at home on a weekday?
(Please check only one)
___ The first thing in the morning

___ A little while after awakening
___ Mid-morning
___ Just before lunch
___ Right after lunch
___ Mid-afternoon
___ Early evening before dinner
___ Right after dinner
___ Late evening
___ Usually don’t read pocket books
6. Forced Ranking Scale

The forced ranking scale produce ordinal values and items are each ranked relative to one
another. This scaling technique obtains not only the most preferred, but also the sequence of the
remaining items.
One of the main advantage of this scaling technique is that the relativity or relationship
that’s measured is among the items. The forced ranking scale indicates what those choices are
likely to be, from unlimited number of alternatives.
The limitation is its failure to measure the absolute standing and the interval between
items. The number of entities or items that can be ranked is also a limitation. Respondents must
first go through the entire list and identify their first choice.
Please rank the books listed below in their order of your preference. Jot the number 1 next to the one you
prefer most, number 2 by your second choice, and so forth.
___ Harry Potter series

___ Lord of the Rings Series
___ Twilight
___ The Lion, the Witch, and the Wardrobe
7. Paired Comparison Scale

This scale is used to measure simple, dichotomous choices between alternatives. The
focus must be almost exclusively on the evaluation of one entity relative to another. This scaling
is accomplished where only two items are ranked at a time.
One major problem is the lack of transitivity in which there are several pairs to be ranked.
If the data is summated there are cases of “ties.” These limitations are avoided by using ratings,
rather than rankings, of items taken two at a time.
170
For each pair of study skills listed below, please put a check mark by the one you most prefer, if you had
to choose between the two.
___ Note taking

___ Memorizing
___ Memorizing
___ Graphic organizer
___ Note taking

___ Graphic organizer
8. Comparative Scale
The comparative scale is appropriate when making comparison(s) between one object
and one or more others. With this type of scale, one entity can be used as the standard or
benchmark by which several others can be judged.
The advantage of this scale is that no absolute standard is presented or required and all
evaluations are made on a comparative basis. Ratings are all relative to the standard or
benchmark used. When there is no absolute standard that exist, the comparative scale approach is
applicable. Another advantage is its flexibility. The same two entities can be compared on
several dimensions or criteria, and several different entities can be compared with the standard.
The comparative scale is used for research interests on comparisons of own sponsor’s
store, brand, institution, organization, candidate, or individual with that of others that are
competitive.
According to Alreck & Settle (1995), that the comparative scales are more powerful in
several ways: They present an easy, simple task to the respondent, ensuring cooperation and
accuracy. They provider interval data, rather than only ordinal values, as rankings do. They
permit several things that have been compared to the same standard to be compared with one
another, and economy of space and time are inherent in them.
Compared to the previous teacher, the new one is… (Check one space)
Very About Very

Superior the same Inferior
1 2 3 4 5
9. Linear, Numeric Scale

The linear, numeric scale is used in judging a single dimension and arrayed on a scale
with equal intervals. The scale is characterized by a simple, linear, numeric scale with extremes
labeled appropriately.
This scaling technique is economical, since a single question, set of instructions, and
rating scale apply to many individual items. It also provides absolute measures of importance
and relative measures, or rankings, if responses among the various items are compared.
The linear, numeric scale is less appropriate for measuring approximate frequency, and
not applicable when direct comparison with a particular standard is required.
171
How important to you is each of the people in the school below?

If you feel that the people in the school is extremely important, pick a number from the far right side of the
scale and jot it in the space beside the item. If you feel it’s extremely unimportant, pick a number from the
far left, and if you feel the importance is between these extremes, pick a number from some place in the
middle of the scale to show your opinion.
Scale
Extremely Unimportant 1 2 3 4 5 Extremely Important
___ 1. Directress
___ 2. Principal
___ 3. Teachers
___ 4. Academic Coordinator
___ 5. Discipline officer
___ 6. Cashier
___ 7. Registrar
___ 8. Librarian
___ 9. Janitor
10. Semantic Differential Scale

In using this scaling device, the image of a brand, store, political candidate, company,
organization, institution, or idea, can be measured, assessed, and compared with that of similar
topic. The areas investigated are called entities.
In using this scale the researcher must first select a series of adjectives that might be used
to describe the topic object. The attributes used by the researcher should be relevant in the minds
of the respondents. Once the adjectives have been identified, a polar opposite of each adjective
must be determined.
The advantage of this scale is its ability to portray images clearly and effectively. The
results provide a profile of the image of the topic that’s rated because several pairs of bipolar
adjectives are used. Also, the entire image profiles can be compared with one another. Another
advantage is its ability to measure ideal images or attribute levels. The disadvantage lies on the
difficulty to arrive with antonyms of the concepts for each item.
Please put a check mark in the space on the line below to show your opinion about the school guidance
counselor
Empathic ______ ______ ______ ______ _____ _____ _____ Apathetic

1 2 3 4 5 6 7
Approachable ______ ______ ______ ______ _____ _____ _____ Aloof
1 2 3 4 5 6 7
Understanding ______ ______ ______ ______ _____ _____ _____ Defensive
1 2 3 4 5 6 7
Unconditional ______ ______ ______ ______ _____ _____ _____ Conditional
1 2 3 4 5 6 7
172
11. Adjective Checklist

This scale is used to view descriptive adjectives or phrases that apply to the topic or
object of study. As compared with the semantic differential scale, the adjective checklist is a
very straightforward method of obtaining information about how a topic is described and viewed.
The advantage of the adjective checklist is its simplicity, directness, and economy. The
adjectives listed can be varied. Short descriptive phrases can even be used. This is useful in
doing exploratory research work.
The disadvantage of this scale is the dichotomous data it yields. There’s no indication
how much each item describes the topic.
Please put a check mark on the space in from of any word that describes your school.
___ Easy ___ Safe

___ Technical ___ Exhausting
___ Boring ___ Difficult
___ Interesting ___ Rewarding
12. Semantic Distance Scale

The semantic distance scale includes a linear, numeric scale below the instructions and
above the descriptive adjectives or phrases. It requires the respondents to provide a rating of how
much each item describes the topic. The data generated by the scale is interval distance from the
item to the topic. This scale is also used to portray an image.
The advantage of this scale is that the adjectives or images can be specified without
comparing it to its opposite and with the interval data that it can produce, it can be manipulated
and statistically processed. The disadvantage is its great complexity, the respondents’ task is
more difficult to explain.
Please pick a number from the scale to show how well each word or phrase below describes your teacher
and jot it in the space in front of each item.
Scale
Not at all 1 2 3 4 5 6 7 Perfectly
___ Intelligent ___ Approachable

___ Strict ___ Good in teaching
___ Respected ___ Can control the class
13. Fixed Sum Scale

The fixed sum scale is used to determine what proportion of some resource or activity has
been devoted to each of several possible choices or alternatives. The scale is most effective when
it’s used to measure actual behavior or action in the recent past. Ordinarily, about 10 different
categories are the maximum, but as few as 2 or 3 can be used. The number to which the data
must total has to be very clearly stated.
The major advantage of this scale is its simplicity and clarity. The instructions are easily
understood and the respondent task is ordinarily easy to complete. It’s also important to add an
inclusive alternative for “others.”
173
Of the last 10 times you went to the library, haw many times did you visited each of the following library
sections.
___ Reference
___ Periodical
___ Circulation
___ Filipinana
___ Other (What? __________________)
SCALE COMBINATION
The scale combinations take the form by listing items together in the same format in
which they share a common scale. This saves valuable questionnaire space. It reduces the
response task and facilitates recording. The respondents mentally carry the same frame of
reference and judgment criteria from one item to the next, so the data are closely comparable.
14. Multiple Rating List

It is a commonly used variation of the linear, numeric scale. The difference is that the
multiple rating list has the labels of the scale extremes at the top. The scale itself is listed beside
each item.
The advantage is that all the respondent has to do is circle a number, and that’s easier
than writing it and the responses for a visual pattern. So the juxtaposition on the responses on a
horizontal spectrum is a closer mapping to the way people actually think about the evaluations
they’re making.
Several colleges and universities are listed below. Please indicate how safe or risky is their location by circling the
number beside it. If you feel it’s very safe, circle a number towards the left. If you feel it’s very risky, circle one
towards the right, and if you think it’s some place in between, circle a number from the middle range that indicates
your opinion.
Extremely Safe Extremely Risky

University of the Philippines 1 2 3 4 5 6 7
De La Salle University-Manila 1 2 3 4 5 6 7
Ateneo de Manila University 1 2 3 4 5 6 7
Mapua Technical Institute 1 2 3 4 5 6 7
University of Sto Tomas 1 2 3 4 5 6 7
15. Multiple Rating Matrix

It is a condensed format in using a combination of linear, numeric scale items. The
difference lies in the way the items are listed in a matrix of rows with multiple columns.
This scaling technique has two advantages: First, it saves questionnaire space. The
multiple rating matrix takes less questionnaire space, yet it captures many data points. The
objects and their characteristics that are rated are all very close to one another. The respondents
are readily able to compare their evaluations from one rating object to another.
The disadvantage is that is in terms of its complexity. The instructions are complex and
the task is a bit difficult.
174
The table below lists 3 universities, and several characteristics of universities along the left side. Please
take one university at a time. Working down the column, pick a number from the scale indicating your
evaluation of each characteristic and jot it on the space in the column below the university and to the right
of the characteristic. Please fill in every space, giving your rating for each university on each
characteristic.
Scale
Very Poor 1 2 3 4 5 6 Excellent
University of the De La Salle Ateneo de Manila

Philippines University-Manila University
Faculty
Research
Facilities
Services
16. Diagram Scale

The diagram scale is useful for measuring configurations of several things, where the
special relationship convey part of the meaning.
Please list the ages of all those in your class in the spaces below. Jot the ages of the boys in the top
circles and the ages of the girls in the bottom circles.
Boys
♂ ♂ ♂ ♂ ♂ ♂
Girls
♀ ♀ ♀ ♀ ♀ ♀
NONVERBAL SCALES
The Nonverbal scales take the form of pictures and graphs to obtain the data. This is
useful for respondents who have limited ability to read or to understand numeric scales.
17. Picture Scale

It facilitates the respondents to recognize letters, numbers, and other symbols using
recognized facial expressions and other illustrations. Some points to consider in creating picture
scale are: (1) They must be very easy for respondents to understand. (2) They should show
something respondents have often seen. (3) They should represent the thing that’s being
measured. (4) They should be easy to draw or create.
18. Graphic Scale

The graphic scale shows in ascending or descending order the amount of information that
is being quantified. The graphic scale provides more useful measurement data because the
175
extremes visually represent none and all or total. Picture and graphic scales are most often used
only for personal interview surveys because they are designed for a special need.
Which of the faces indicates your feeling about your math course?
How much have you learned in your math course?
5 4 3 2 1
What is the level of your math proficiency?

176
SOCIAL SCALING
Social scaling as defines by Lazarfield (1958) as “properties of collectives which are

obtained by performing some operation on data about the relations of each member to some or
all of the other.”
19. Sociometric Scaling

Sociometric measures are generally constructed by administering to all members of the
group a questionnaire asking each about his or her relations with the other members of the group
(Bailey, 1995).
One way of analyzing sociometric data is in the form of the sociometric matrix. A
sociometric matrix lists the persons’ names in both the rows and columns, and uses some code to
indicate which person is chosen by the subject in response to the question.
20. Sociogram
The sociogram is a graphic representation of sociometric data. In a sociogram each
individual is represented by an illustrative symbol. The symbols are then connected by arrows
and it describes the relationship among the individuals involved. Those chosen most often are
referred to as stars, those not chosen by others are called isolates, and the small groups made up
of individuals who choose one another are called cliques (Best & Khan, 1990).
Scale Selection Criteria
Some scales are easily identified as potentially useful for obtaining some information,
needs, and questions, and there are often other scales that are clearly inappropriate.
How to create effective scales?
1. Keep it simple. The less complex scale should be used. Even after identifying a scale
consider an easier and simpler scale.
2. Respect the respondent. Select scales that will make a quick and easy as possible for the
respondents that will reduce non-response bias and improve accuracy.
3. Dimension the response. The dimensions that respondents think is not usually common with
one another, some commonality must be discovered. It must not be obscure and difficult, and
they should parallel respondents thinking.
4. Pick the denominations. Always use the denominations that are best for respondents. The
data can later be converted to the denominations sought by information users.
5. Choose the range. Categories or scale increments should be about the same breadth as those
ordinarily used by respondents.
6. Group only when required. Never put things into categories when they can easily be
expressed in numeric terms.
7. Handle neutrality carefully. If respondents genuinely have no preference, they’ll recent the
forced choice inherent in a scale with an even number of alternatives. If feelings aren’t
especially strong, an odd number of scale points may result in fence- riding or piling in the
midpoint, even when some preference exist.
177
8. State instructions clearly. Even the least capable respondents must be able to understand. Use
language that’s typical of the respondents. Explain exactly what the respondent should do
and the task sequence they should follow. List the criteria by which they should judge and
use an example or practice if there is any doubt.
9. Always be flexible. The scaling techniques can be modified to fit the task and the
respondents.
10. Pilot test the scales. Individual parcels can be checked with a few typical respondents.
References
Anastasi, A. (1990). Psychological testing. New York: McMillan Pub.
Anderson, L. W. (1981). Assessing affective characteristics in the schools. Boston: Allyn and
bacon.
Alreck, P. L. & Settle, R. B. (1995). The survey research handbook (2nd ed.). Chicago: Irwin
Prof. Books.
Bailey, K. D, (1995). Methods of social research (4th ed.). New York: McMillan Pub.
Bandura, A. (1977). Self-efficacy: Toward a unifying theory of behavioral change.

Psychological Review, 84, 191-215.
Best, J. W. & Kahn, J. V. (1995). Research in education (6th ed.). New Jersey: Prentice Hall.
Dilts, R. B. (1999). Sleight of mouth: The magic of conversational belief change. Capitola, CA:
Meta Publications.
Halstead, J. M. & Taylor, M. J. (2000). Learning and teaching about values: A review of recent
research. Cambridge Journal of Education, 30, 169-203.
Knafo, A. & Schwartz, S.H. (2003). Parenting and adolescents’ accuracy in perceiving parental
values. Child Development, 74(2), 595-611.
Lazarfield, P. F. (1958). Evidence and Inference in Social Research. Daedalus, 8, 99-130.
National Council for Accreditation of Teacher Education. (2001). Professional standards for the
accreditation of schools, colleges, and departments of education. Washington, DC: Author.
Meece, J., Parson, J., Kaczala, C., Goff, S., & Futerman. R. (1982). Sex differences in math
achievement: Toward a model of academic choice. Psychological Bulletin, 91, 324 – 348.
178
Overmier, J.B.& J. A. Lawry. (1979). Conditioning and the mediation of behavior. In G.H.
Bower (ed.). The psychology of learning and motivation (pp. 1- 55). New York: Academic Press.
Pajares, M. F. (1992). Teachers' Beliefs and Educational Research: Cleaning Up a Messy

Construct. Review of Educational Research, 62, 307-332.
Richardson, V. (1996). The role of attitude and beliefs in learning to teach. In J. Sikula, T.
Buttery, & E. Guyton (Eds.), Handbook of research on teacher education ( pp. 102-119). New
York: Macmillan.
Savickas, M. L. (1999). The psychology of interests. In M. L. Savickas & A. R. Spokane (Eds.),

Vocational interests: Meaning, measurement and counseling use (pp. 19-56). Palo Alto, CA:
Davies-Black.
Schommer, M. (1990). Effects of beliefs about the nature of knowledge on comprehension.

Journal of Educational Psychology, 82, 498-504.
Sta. Maria, M. & Magno, C. (12007). Dimensions of Filipino negative emotions. Paper presented
at the 7th Conference of the Asian Association of Social Psychology, July 25-28, 2007 in Kota
Kinabalu, Sabah, Malaysia.
Strong, E. K. (1955). Vocational interests 18 years after college. Minneapolis: University of

Minnesota Press.
Taylor, E. (2003). Making meaning of non-formal education in state and local parks: A park
educator's perspective. In T. R. Ferro (Ed.), Proceedings of the 6th Pennsylvania Association of
Adult Education Research Conference (pp. 125-131). Harrisburg, PA, Temple University.
Taylor, E., & Caldarelli, M. (2004). Teaching beliefs of non-formal environmental educators: A
perspective from state and local parks in the United States. Environmental Education Research,
10, 451-469.
Zimbardo, P. G, & Leippe, M. R. (1991). The psychology of attitude change and social
influence. New York: McGraw Hill.
179
Chapter 6
Art of Questioning
Chapter Objectives:
1. Develop a deep understanding on the functional role of questioning in enhancing

students’ learning;
2. Critically assess the circumstances under which certain types of questions may be more
useful;
3. Frame questions that are appropriate for the target skills to be developed in the students
Lessons:
1. Functions of Questioning
2. Types of Questions
3. Taxonomic Questioning
Lesson 1
Functions of Questioning
Every time we get inside our classrooms and deal with our students in various teaching
and learning circumstances, our ability to ask questions is always brought to fore. Being
intricately embedded in our pedagogies and assessments, questioning is one of the most basic
processes we deal with. But, just how appropriate our questions are, we need to discuss about the
art of questioning.
To begin with, we ask ourselves this fundamental question, “Why do we ask questions?”
From our teaching methods and strategies to our assessments, questioning is inevitable. From the
transmissive to more constructivist approaches of teaching, asking questions is always a
“mainstay” process. To answer this fundamental question, we need to first look into the function
of questioning as its works in our own selves, then in terms of how it works in the learning
process in general.
As you are reading this chapter, or even the previous chapter of this book, you
effortlessly ask questions. Why is that? How important is that process in our understanding of the
concepts are trying to learn about? Whenever you ask a question, regardless of whether you just
keep it in mind or express it verbally, you activate your senses and drive your attention to what
you are currently processing. As you engage a reading material, for example, and you ask
questions about what you are reading, you are bringing yourself into a deeper level of the
learning experience where you become more “into the experience.” Obviously, questioning
brings you to the level of focused and engaged learning as you become particularly attentive to
everything that takes place within and around you.
180
In the classrooms, we ask our students many questions without always being aware that
the kinds of questions we ask make or break students’ deep academic engagement. At this
juncture, therefore, we emphasize the point that, as teachers, just asking questions is not enough
to bring our students to the level of engagement we desire. What matters in this case is the
quality of the questions we ask them. The effects of questioning on our students differ,
depending on how “good” or “bad” our questioning is.
From various studies, we now know that “good” questioning positively affects students’
learning. Teachers’ good questioning boosts students’ classroom engagement because the
atmosphere where good questions are tossed encourages them to push themselves some more
into the state of inquiry. If students’ feel that questions are interesting, sensible, and important,
they are driven not only to “know more” but also “think more.” Good questioning encourages
deep thinking and higher levels of cognitive processing, which can result in better learning of the
subject matter in focus. One distinct mark of a classroom that employs good questioning is that
students generally participate in a scholarly conversation. This happens because teachers’ good
questioning encourages the same good questioning from the students as they discuss with their
teachers and with each other.
On the contrary, bad questioning distorts the climate of inquiry and investigation. It
undermines the students’ motivation to “know more” and “think more” about the subject matter
in focus. If, for example, a teacher’s question makes the student feels stupid and impossibly
capable of answering, the whole process of questioning leads to a breakdown of students’
academic engagement. Indeed, it is important for a teacher to always think of his or her
intentions for tossing questions in the class. Certainly, questions encouraged by a sound motive
will work better that those ill-motivated ones.
Think about questioning as a tool for increasing

students’ academic engagement.
Mentally explore into what kinds of motives may

encourage learning and what other kinds of motives
that may undermine students’ learning.
Write your thoughts or bring it up in class for purposes
of academic discussion.
181
Lesson 2
Types of Questions
Now that you have just explored on the kinds of motives that may encourage or
undermine students’ learning, it is helpful if you focus on those motives that establish an
atmosphere of inquiry in your classrooms. Focus on those intentions that will allow for the use of
questioning as a tool for deep learning rather than those that embarrass students and discourage
them from engaging your lessons.
However, because teaching is not a trial-and-error endeavor, motives might not be
enough to guide our questioning so that it makes desirable effects on our students’ learning. With
the sound motive being the undercurrent of our questioning, we need to also know what types of
questions to ask to engage our students.
Interpretive Question
This type of question calls for students’ interpretation of the subject matter in focus. It
usually asks students to provide missing information or ideas so that the whole concept is
understood. An interpretive question assumes that, as students engage the question, they monitor
their understanding of the consequences of the information or ideas. In a class with primary
graders, the teacher narrated a story a boy in dark-blue shirt who was lost in a crowd of people at
a carnival one evening, and his mother roved around for hours to find him. After narrating the
story, one of the questions the teacher asked her pupils was “If the boy wore a bright-colored
shirt, what could change in the mother’s effort in looking for the boy?” Question that call for
interpretation of a situation is a powerful tool for activating your students’ analytical ability.
Inference Question
If the question you ask intends that students go beyond available facts or information and
focus on identifying and examining the suggestive clues embedded in the complex network of
facts or information, you may use toss up an inference question. After a series of discussions on
the Katipunan revolution in a Philippine history class, the teacher presented a picture that
appeared to capture a perspective of the Katipunan revolution. As the teacher showed the picture,
he asked, “What do you know by looking at this picture?” Having learned about Katipunan
revolution in its different angles, students were prompted to explore on clues that may suggest
certain perspectives of the event, and focus on a more salient clue that represented one
perspective, such as, for instance, the common people’s struggles during the revolution, or the
bravery of those who fought for the country, or the heroism of its leaders. Inference questions
encourage students to engage in higher-order thinking and organize their knowledge rather than
just randomly fire out bits and pieces of information.
182
Transfer Question
Questioning is one of the processes that affect transfer (Mayer, 2002). Transfer questions
are tools for a specific type of inference where students are asked to take their knowledge to new
contexts, or bring what they already know in one domain to another domain. Questions of this
type bring students to a level of thinking that goes beyond just using their knowledge where it is
used by default. For example, after a lesson on the literary works of Edgar Allan Poe, students
were already familiar with Poe’s literary style or approach. So that the teacher can infer on
students’ familiarization and understanding of Poe’s rhetoric “trademark,” the teacher thinks of a
literary work from a different source, let us say, one from the long list of fairy tales. Then the
teacher asked a transfer question, “Imagine that Edgar Allan Poe wrote his version of the fairy
tale story, ‘Jack and the beanstalk,’ you are making a critical review of his version of the story,
what do you expect to see in his rhetoric quality?” This question prompts the students to bring
their knowledge of Poe’s rhetoric style to a new domain, that is, a different literary piece with a
different rhetoric quality. This question further encourages the students to thresh out only those
relevant knowledge that must be transferred, and therefore, helps them account for their learning
of a subject matter.
Predictive Question
Asking predictive questions allows students to think in the context of a hypothesis.
Through questions of this type, students infer on what is likely to happen given the
circumstances in hand. In other words, students are compelled to think about the “what if” of the
phenomenon under study, mindful of the circumstances on focus. This type of question has long
been used in the natural sciences, but is certainly not for their exclusive use. In any subject area,
we can let our students think scientifically. One of the ways to do so is to let them engage our
predictive questions or to drive them to raise the same type of question in the class. Predictive
questions prompt the students to go beyond the default condition and infer on what is likely to
happen if some circumstances change. Here, students make use of higher levels of cognitive
processing as they estimate probabilities.
Metacognitive Question
The types of questions discussed above all focus on students’ cognitive processes. To
bring students into the level of regulation over their own learning, we also need to ask
metacognitive questions. Questions of this type allow students to think about how they are
thinking, and learn about how they are learning your course lessons. Successful learners tend to
show higher level of awareness of how they are thinking and learning. They show clear
understanding of how they struggle with academic tasks, comprehend written texts, solve
problems, or make decisions. A metacognitive question invites students to know how they know,
and, thus, become more aware of the processes that take place within them while they are
thinking and learning. In a math class, for instance, the teacher not only asks to solve a word
problem but also to describe how the student is able to solve the word problem.
183
A. Think of a subject matter within your area of specialization.

B. Make a rough plan as to how you might present the subject matter in an
appropriate class (considering grade/year level).
C. Formulate questions that will likely encourage your students to engage the
subject matter. Try as much as possible to formulate questions in all types of
questions discussed above.
D. Justify why those questions fall under their respective types.
Lesson 3
Taxonomic Questioning
After trying your best to formulate questions for every type of questions discussed above,
we will now bring you to the discussion on planning the questioning in terms of taxonomic
structure. Questions differ not only in terms of types but also in terms of what cognitive
processes are involved based on the taxonomy of learning targets you are using. For our students
to benefit more from our questioning, it is necessary to plan our questioning taxonomically.
In Chapter 2 of this book we learned about the different taxonomic tools for setting your
learning intents or target. These tools also serve as frameworks for planning and constructing
your questions. Because questioning influences the quality of students’ reasoning, the questions
we ask our students to respond to must be pegged on certain levels of cognitive processes
(Chinn, O’Donnell, & Jinks, 2000). For example, Bloom’s taxonomy provides a way of
formulating questions in various levels of thinking, as in the following:
Questions intended for knowledge should encourage recall of information. Such
questions may be What is the capital city of…? or What facts does… tell?
For comprehension, questions should call for understanding of concepts, such as What is
the main idea of…? or Compare the…
Questions at the level of application must encourage the use of information or concept in
a new context, like How would you use…? or Apply… to solve…
If analysis is desired where students are driven to think critically, the questions must
focus on relationships of concepts and logic of arguments, such as What is the difference
between…?” or How are…and…analogous?
To encourage synthesis, questioning must focus on students’ original thinking and
emergent knowledge, like Based on the information, what could be a good name for…? or What
would…be like if…?
184
In terms of questioning at the level of evaluation, students are prompted to judge the
ideas or concepts based on certain criteria. Questions may be like Why would you choose…? or
What is the best strategy for…?
If you are to use the revised taxonomy where you need to consider both the knowledge
and cognitive process dimensions, it is important that you first identify the knowledge dimension
you wish to focus, and ask yourself, “What questions will be appropriate for every knowledge
dimension?
A. Think about your understanding of the definitional meaning of each type of

knowledge (factual, conceptual, procedural, metacognitive).
B. Based on your understanding of their definitional meanings, discuss what kinds of
questions to ask for each of the types of knowledge.
C. Discuss your ideas with your teacher and/or classmates.
D. Synthesize your understanding after sharing your thoughts and listening to those
of others.
Your clear understanding of the kinds of questions to ask based on the types of
knowledge in focus helps you to categorically focus on any of those types of knowledge,
depending on what is relevant to your teaching and assessment at any given time. After
anchoring your questions into a particular type of knowledge, the next step is to frame your
question so that it conveys the relevant cognitive process needed for a successful learning of the
subject matter. If your focus is factual knowledge, you can toss up different questions that vary
according to the cognitive processes. You can raise a question on factual knowledge that
necessitates the use of recall (remember) or synthesis (create), depending on your learning
intents. You can navigate in the same way across the different levels of cognitive processing
while anchoring on any other type of knowledge.
A. Based on a subject matter of your choice, formulate at least one factual

knowledge question for each of the six levels of cognitive processes in the revised
Bloom’s taxonomy (remember, understand, apply, analyze, evaluate, & create).
B. Based on the same (or a different) subject matter, do the same for conceptual
knowledge, then for procedural knowledge, then for metacognitive knowledge.
C. Share your output in the class for discussion.
185
You may also try out on the alternative taxonomic tools discussed in Chapter 2, and see
how you can brush up on your art of questioning while maintaining your track towards your
learning intents. When you wish to verify the validity of your questions, always go back to the
conceptual description of the taxonomy. It should be an important process as you build on your
art of questioning so that, aside from its artistic sense, your questioning also becomes scientific
in so far as teaching-and-learning process in concerned.
Lesson 4
Practical Considerations in Questioning
We now give you some tips in questioning. These tips are add-on elements to the items
that have already been discussed in the preceding section of this chapter.
Consider your students’ interest

Before you ask a question with your students, think if the question you are about to ask
can arouse their interest in the subject matter. Think of a context that might interest your students
and use that context as the backdrop of your question. Here, it is important that you know the
“language” of your students. You should be able to anticipate their needs based on their
developmental characteristics. Also, you should have an idea of their interest, such as the kinds
or genres of music they enjoy listening, the computer games they play, or the kinds of sports they
engage into, and so on. If you contextualize your question in these aspects, you know that your
students are likely to engage your question.
Hold on to your targets

As you did the “On-Task” exercises in the previous section of this chapter you realized
that the questions we toss up in our classrooms must be anchored on our learning intents.
Airasian (2000) contends that the questions we ask communicate to our students the topics and
processes are important. To be on track, classroom questioning should be aligned to relevant
instructional targets. Always remember to ask questions that do not only allow students to
orchestrate their cognitive prowess but also those that scaffold them to think at the level you so
desire. Make sure your questions are sensible as far as your learning targets are concerned. As
much as possible ask questions that are both relevant to your learning intents and interesting to
your students.
186
Expect for answers

Perhaps common sense will tell you that whenever we ask a question, we always expect
for a good answer. But reality has it that some teachers pose questions that are obscure and that
relevant answers are hardly drawn from the students. If, with conscious effort, we expect for
relevant answers, we always make sure that our questions are clear, and that they can be
answered by our studies based on their capacities. To do this, we need to first understand our
students’ the developmental characteristics, their actual capacities and aptitude. Knowing all
these, we can toss up questions that match their capabilities. Finally, Do not ask a question that is
only part of your script so that it only serves as a cue of what you will say next. Do not ask a
question if you are not actually expecting your students to respond to you if you are not really
interested to pick up on your students’ answers. Ask a question only if you truly intend to let
your students respond to it.
Push your students farther

While it is important to ask questions that match with students’ capacities, it is also vital
that the questions we ask our students to answer challenge them to exhaust their cognitive
resources so that if they realize that they are on the edge of their available knowledge, the
question encourages them to think of possibilities. This exercise gives students the opportunity to
discover new realms of knowledge that will be explored. Also, this will build on their scientific
thinking of problematizing the existing knowledge by subjecting it to tentativeness so that more
argument, more exploration, and more thinking become necessary resources for learning.
What do you now know about the art of questioning?

Account for your understanding of the why’s and how’s
of questioning.
How important is the art of questioning in the

assessment process? Reason out some benefits of
developing the art of questioning on our assessment
practices.
187
References:
Airasian, P. W. (2000). Assessment in the classroom: A concise approach. 2nd edition. USA:
McGraw-Hill Companies.
Chinn, C. A., O’Donnell, A. M., & Jinks, T. S. (2000). The structure of discourse in
collaborative learning. Journal of Experimental Education, 69, 77-97.
Mayer, R. E. (2002). The promise of educational psychology Volume II: Teaching for
meaningful learning. NJ: Merrill Prentice Hall.
188
Chapter 7
Grading Students
Chapter Objectives
1. Define grading in the educational setting of the Philippines.

2. Explain grading as a process.
3. Identify the different purposes of grading.
4. Explain the rationales for grading.
5. Reflect on the advantages and disadvantages of each grading rationale
6. Reflect on when a rational for grading is appropriate or not.
Lessons
1. Defining Grading
2. The Purposes of Grading
a. Feedback
b. Administrative Purposes
c. Discovering Exceptionalities
d. Motivation
3. Rationalizing Grades
a. Absolute/ Fixed Standards
b. Norms
c. Individual Growth
d. Achievement Relative to Ability
e. Achievement Relative to Effort
189
Lesson 1
Defining Grading
Effective and efficient way of recording and reporting evaluation results is very important
and useful to persons concerned in the school setting. Hence, it is very important that students’
progress is recorded and reported to them, their parents and teachers, school administrators,
counselors and employers as well because this information shall be used to guide and motivate
students to learn, establish cooperation and collaboration between the home and the school and
in certifying the students’ qualifications for higher educational levels and for employment. In the
educational setting, grades are used to record and report students’ progress. Grades are essential
in education such that it is through it that students’ learning can be assessed, quantified and
communicated. Every teacher needs to assign grades which are based on assessment tools such
as tests, quizzes, projects and so on. Through these grades, achievement of learning goals can be
communicated with students and parents, teachers, administrators, and counselors. However, it
should be remembered that grades are just a part of communicating student achievement;
therefore, it must be used with additional feedback methods.
According to Hogan (2007), grading implies (a) combining several assessments, (b)
translating the result into some type of scale that has evaluative meaning, and (c) reporting the
result in a formal way. From this definition, we can clearly say that grading is more than
quantitative values as many may see it; rather, it is a process. Grades are frequently
misunderstood as scores. However, it must be clarified that scores make up the grades. Grades
are the ones written in the report cards of students which is a compilation of students’ progress
and achievement all through out a quarter, a trimester, a semester or a school year. Grades are
symbols used to convey the overall performance or achievement of a student and they are
frequently used for summative assessments of students. Take for instance two long exams, five
quizzes, and ten homework assignments as requirements for a quarter in a particular subject area.
To arrive at grades, a teacher must be able to combine scores from the different sets of
requirements and compute or translate them according to the assigned weights or percentages.
Then, he/ she should also be able to design effective ways on how he/ she can communicate it
with students, parents, administrators and others who are concerned. Another term not
commonly used to refer to the process is marking. Figure 1 shows a graphical interpretation
summarizing the grading process.
Figure 1. Summary of Grading Process.
Separate Assessments COMBINED

Tests, Quizzes, Exam Depends on the assigned
Projects, Seatworks, weights/ percentages for
Worksheets…etc. each set of requirements.
REPORTED TRANSLATED
Grades are communicated Combined scores are
to teachers, students, translated into scales
parents, administrators, etc. with evaluative meaning.
190
Review Questions:
1. Why is grading considered as a process?

2. Explain the different steps that make up grading.
3. Differentiate grades from scores.
4. How are grades essential in the educational context?
5. How can you use grades in different contexts?
191
Lesson 2
The Purposes of Grading
Grading is very important because it has many purposes. In the educational setting, the
primary purpose of grades is to communicate to parents, and students their progress and
performance. For teachers, grades of students can serve as an aid in assessing and reflecting
whether they were effective in implementing their instructional plans, whether their instructional
goals and objectives were met, and such. Administrators on the other hand, can use the grades of
students for a more general purpose as compared to teachers, such that they can use grades to
evaluate programs, identify and assess areas that needs to be improved and whether or not
curriculum goals and objectives of the school, and state has been met by the students through
their institution. From these purposes identified, the purposes of grading can be sorted out into
four major parts in the educational setting.
Feedback
Feedback plays an important role in the field of education such that it provides
information about the students’ progress or lack. Feedback can be addressed to three distinct
groups concerned in the teaching and learning process: parents, students, and teachers.
Feedback to Parents. Grades especially conveyed through report cards provide a critical
feedback to parents about their children’s progress in school. Aside from grades in the report
cards however, feedbacks can also be obtained from standardized tests, teachers’ comments.
Grades also help parents to identify the strengths and weaknesses of their child.
Depending on the format of report cards, parents may also receive feedbacks about their
children’s behavior, conduct, social skills and other variables that might be included in the report
card. On a general point of view, grades basically tell parents whether their child was able to
perform satisfactorily.
However, parents are not fully aware about the several and separate assessments which
students have taken that comprised their grades. Some of these assessments can be seen by
parents but not all. Therefore, grades of students, communicated formally to parents can
somehow let parents have an assurance that they are seeing the overall summary of their
children’s performance in school.
Feedback to Students. Grades are one way of providing feedbacks to students such that it
is through grades that students can recognize their strengths and weaknesses. Upon knowing
these strengths and weaknesses, students can be able to further develop their competencies and
improve their deficiencies. Grades also help students to keep track of their progress and identify
changes in their performance.
Personally, I feel that this feedback is directly proportional with the age and year level
with the students such that grades are given more importance and meaning by a high school
student as compared to a grade one student; however, I believe that the motivation grades can
give is equal across different ages and year levels. Such that grade one students (young ones) are
motivated to get high grades because of external rewards and high school students (older ones)
are also motivated internally to improve one’s competencies and performance.
192
Feedback to Teachers. Grades serve as relevant information to teachers. It is through

grades of students that they can somehow (a) assess whether students were able to acquire the
competencies they are supposed to have after instruction; (b) assess whether their instruction
plan and implementation was effective for the students; (c) reflect about their teaching strategies
and methods; (d) reflect about possible positive and negative factors that might have affected the
grades of students before, during and after instruction; and (e) evaluate whether the program was
indeed effective or not. Given these beneficial purposes of grades to teachers, we can really say
that teaching and learning is a two way interrelated process, such that it is not only the students
who learn from the teacher, but the teacher also learns from the students. Through grades of
students, a teacher can be able to undergo self- assessment and self- reflection in order to
improve herself and be able to recognize relative effectiveness of varying instructional
approaches across different student groups being observed and be flexible and effective across
different situations.
Administrative Purposes
Promotion and Retention. Grades can serve as one factor in determining if a student will
be promoted to the next level or not. Through the grades of students, skills, and competencies
required of him to have for a certain level can be assumed whether or not he was able to achieve
the curriculum goals and objectives of the school and/ or the state. In some schools, the grade of
students is a factor taken into consideration for his/ her eligibility in joining extracurricular
activities (performing, theater arts, varsity, cheering squads… etc.). Grades are also used to
qualify a student to enter high school or college in some cases. Other policies may arise
depending on the schools’ internal regulations. At times, failing marks may prohibit a student
from being a part of the varsity team, running for officer, joining school organizations, and some
privileges that students with passing grade get. In some colleges and universities, students who
get passing grades are given priority in enrolling for the succeeding term, as compared to
students who get failing grades.
Placement of Students and Awards. Through grades of students, placement can be done.
Grades are factors to be considered in placing students according to their competencies and
deficiencies. Through which, teaching can be more focused in terms of developing the strengths
and improving the weaknesses of students. For example, students who consistently get high,
average and failing grades are placed in one section wherein teachers can be able to focus more
and emphasize students’ needs and demands to ensure a more productive teaching learning
process. Another example which is more domain specific would be grouping students having
same competency on a certain subject together. Through this strategy, students who have high
ability in Science can further improve their knowledge and skills by receiving more complex and
advanced topics and activities at a faster pace, and students having low ability in Science can
receive simpler and more specific topics at a slower pace (but making sure they are able to
acquire the minimum competencies required for that level as prescribed by the state curriculum).
Aside from placement of students, grades are frequently use as basis for academic awards. Many
or almost all schools, universities and colleges have honor rolls, and dean’s list, to recognize
student achievement and performance. Grades also determine graduation awards for the overall
achievement or excellence a student has garnered through out his/ her education in a single
subject or for the whole program he has taken.
193
Program Evaluation and Improvement. Through the grades of students taking a certain
program, program effectiveness can be somehow evaluated. Grades of students can be a factor
used in determining whether the program was effective or not. Through the evaluation process,
some factors that might have affected the program’s effectiveness can be identified and
minimized to improve the program further for future implementations.
Admission and Selection. External organizations from the school also use grades as
reference for admission. When students transfer from one school to another, their grades play
crucial role for their admission. Most colleges and universities also use grades of students in their
senior year in high school together with the grade they shall acquire for the entrance exam.
However, grades from academic records and high stakes tests are not the sole basis for
admission; some colleges and universities also require recommendations from the school,
teachers and/ or counselors about students’ behavior and conduct. The use of grades is not
limited to the educational context, it is also used in employment, for job selection purposes and
at times even in insurance companies that use grades as basis for giving discounts in insurance
rates.
Discovering Exceptionalities
Diagnosing Exceptionalities. Exceptionalities, disorders and other malfunctions can also

be determined through the use of grades. Although the term exceptionality is often stereotyped as
something negative, it has its positive sides such as giftedness and such. Grades play an essential
role in determining these exceptionalities such that it is a factor to be considered in diagnosing a
person. Through grades, intelligence, ability, achievement, aptitude, and other factors that are
quite difficult to measure can be interpreted and therefore be given proper interventions and
treatments when they fall out of the established norms.
Counseling Purposes. It is through the grades of students that teachers can somehow seek
the assistance of a counselor. For instance, a student who normally performs well in class
suddenly incurs consecutive failing marks, then teachers who was able to observe this should be
able to think and reflect about the probable reasons that caused the student’s performance to
deteriorate and consult with the counselor about procedures she can do to help the student. If the
situation requires skills that are beyond the capacity of the teacher, then referral should be made.
Grades are also used in counseling when personality, ability, achievement, intelligence, and other
standardized tests are being measured.
Motivation
Motivation can be provided through grades; most students study hard in order to acquire
good grades; once they get good grades, they are motivated to study harder to get higher grades.
Some students are motivated to get good grades because of their enthusiast to join extra-
curricular activities, since some schools do not allow students to join extra curricular activities if
they have failing grades. There are numerous ways on how grades serve as motivators for
students across different contexts (family, social, personal…etc.). Thus, grades may serve as one
of the many motivators for students.
194
Review Questions:
1. What are the different purposes of grades in the educational context? Explain each.
2. How do grades motivate you as a student?
3. How does feedback affect your performance in school?
Activity
1. Ask 10-15 grade 1 students on how grades motivate them.

2. Ask 10-15 high school or college students on how grades motivate them.
3. Tabulate the data you were able to gather and compare how grades motivate students at
different levels.
4. Report your findings in class.
195
Lesson 3
Rationalizing Grades
Attainment of educational goals can be made easier if grades could be accurate enough to
convey a clear view of a student’s performance and behavior. But the question is what basis shall
we use in assigning grades? Should we grade students in relation to (a) an absolute standard, (b)
norms or the student’s peer group, (c) the individual growth of each student, (d) the ability of
each student, or (e) the effort of the students/? Each of these approaches has their own
advantages and disadvantages depending on the situation, test takers, and the test being used. It is
expected for teachers to be skillful in determining when to use a certain approach and when not
to.
Absolute Standards. Using absolute standards as basis for grades means that students’
achievement is related to a well defined body of content or a set of skills. For a criterion-
referenced measurement, this basis is strongly used. An example for a well defined body of
content would be: “Students will be able to enumerate all the presidents of the Philippines and
the corresponding years they were in service.” An example for a set of skills would be something
like: “Students will be able to assemble and disassemble the M16 in 5 minutes.” However, this
type of grading system is somewhat questionable when different teachers make and use their
own standards for grading students’ performance since not all teachers have the same set of
standards. Therefore, standards of teachers may vary across situations and is subjective
according to their own philosophies, competencies and internal beliefs about assessing students
and education in general. Hence, this type of grading system would be more appropriate when it
is used in a standardized manner. Such that a school administration or the state would provide
the standards and make it uniform for all. An example for tests wherein this type of grading is
appropriate would be standardized tests wherein scales are from established norms and grades
are obtained objectively.
Norms. The grades of students in this type of grading system is related to the performance
of all others who took the same test; such that the grade one acquires is not based on set of
standards but is based from all other individuals who took the same test. This means that students
are evaluated based on what is reasonably expected from a representative group. To further
explain this grading system, take for instance that in a group of 20 students, the student who got
the most number of correct answers- regardless whether he got 60% or 90% of the items
correctly, gets a high grade; and the student who got the least number of correct answers-
regardless whether he got 10% or 90% of the items correctly, would get a low grade. It can be
observed in this example that (a) 60% would warrant a high grade if it was the highest among all
the grades of participants who took the test; and (b) a 90% can possibly be graded as low
considering that it was the lowest among all the grades of the participants who took the test.
Therefore, this grading system is not advisable when the test is to be administered in a
heterogeneous group because results would be extremely high or extremely low. Another
problem for this approach is the lack of teacher competency in creating a norm for a certain test
which lets them settle for absolute standards as basis for grading students. Also, this approach
would require a lot of time and effort in order to create a norm for a sample. This approach is
also known as “grading on the curve.”
196
Individual Growth. The level of improvement is seen as something relevant in this type
of grading system as compared to the level of achievement. However, this approach is somewhat
difficult to implement such that growth can only be observed when it is related to grades of
students prior to instruction and grades after the instruction, hence, pretests and posttests are to
be used in this type of grading system. Another issue about this type of grading system is that it
is very difficult to obtain gain or growth scores even with highly refined instruments. This
system of grading disregards standards and grades of others who took the test; rather, it uses the
quantity of progress that a student was able to have to assess whether he/ she will have a high
grade or a low grade. Notice that initial status of students is required in this type of grading
system.
Achievement Relative to Ability. Ability in this context refers to mental ability,

intelligence, aptitude, or some familiar constructs. This type of grading is quite simple to
understand such that a student with high potential on a certain domain I expected to achieve at a
superior level, and the student with limited ability should be rewarded with high grades if the
student exceeds expectations.
Achievement Relative to Effort. Similarly, this type of grading system is relative to the
effort that students exerted such that a student who works really diligently, responsibly,
complying to all assignments and activities, doing extra credit projects and so on should receive
a high grade regardless of the quality of work he was able to produce. On the contrary, a student
who produces a good work will not be merited a high grade if he was not able to exert enough
effort. Notice that grades are based merely on efforts and not on standards.
As mentioned earlier, each of these approaches in arriving at grades have their own
strengths and limitations.
Using absolute standards, one can focus on the achievement of students. However, this
approach fails to state reasonable standards of performance and therefore can be subjective.
Another drawback in this approach would be the difficulty in specifying clear definitions;
although this difficulty can be solved, it can never be eliminated.
The second approach is appealing such that it ensures realism that is at times lacking in
the first approach. It avoids the problem of setting too high or too low standards. Also, situation
wherein everyone fails can be prevented. However, the individual grade of students is dependent
on the others which is quite unfair. A second drawback to this kind of approach is that how will
the teacher choose the relevant group; will it be the students in one class, students in the school,
students in the state, or students in the past ten years? Answers to these questions are essential to
be answered by a teacher to have a rationale if achievement in relation to other students. Another
difficulty for this approach is the tendency of encouraging unhealthy competitions; if this
happens, then students become competitors with one another and it is not a good environment for
teaching and learning.
The last three approaches can be clustered such that they have similar strengths and
weaknesses. The strength of theses three is that they focus more on the individual, making the
individual define a standard for himself. However, these three approaches have two drawbacks;
one is that conclusions would seem awkward, or if not, detestable. For example, a student who
performed low but was able to exert effort gets a high grade; but a student who performed well
but exerted less effort got a lower grade. Another example would be: Ken with an IQ of 150 gets
197
a lower grade compared to Tom with an IQ of 80 because Ken should have performed better;
while we were pleasantly amazed with Tom’s performance… Kyle starting with little knowledge
about statistics learned and progressed a lot. Lyra, who was already proficient and
knowledgeable in statistics, gained less progress. After the term, Kyle got a higher grade since he
was able to progress more; although it can be clearly seen that Lyra is better than him.
Conditions of these types make people feel uncomfortable with such conclusions. The second
drawback would be reliability. Reliability is hard to obtain when we use differences as basis for
grades of students. In the case of effort, it is quite hard to measure and quantify it, therefore, it is
based on subjective judgments and informal observations. Hence, resulting grades from these
three approaches when combined to achievement are somewhat unreliable. Table 1 presents a
summary of the advantages and disadvantages of the different rationales in grading.
Table 1. Advantages and Disadvantages of Different Rationales in Grading.
Rationale Advantages Disadvantages______
Absolute Standards - Focuses exclusively on - Standards are opinionated

achievement - Difficulty in getting clear
definitions
Norms - Ensures realism - Individual grades depend on

- Always clear to determine - Choosing relevant group
Improvement, - Concentration on individual - Awkward conclusions

Ability, Effort - Reliability
Review Questions:
1. What rationale for grading do you feel most effective?

2. What rationale for grading is used in your school? Is it uniform across different
subjects?
3. When is each of the rationales effective to apply?
4. When is each of the rationales ineffective to apply?
Activity
Conduct a survey to your 20 teachers about:
1. What for them is the most effective rationale for grading and why?
2. What for them is the most ineffective rationale for grading and why?
3. When is each of the rationales most effective to use?
4. Present the results in a table form.
5. Make a reflection paper about the results you gathered from twenty teachers.
198
References
Hogan, T. P. (2007). Educational assessment a practical introduction. United States of America:

John Wiley & Sons, Inc.
Popham, J. W. (1998). Classroom assessment: What teachers need to know (2nd ed.). Needham
Heights, MA: Allyn & Bacon.
Brookhart, S. M. (2004). Grading. Upper Saddle River, New Jersey: Pearson Education Inc.
Oriondo, L. L. & Dallo-Antonio, E. M. (1984). Evaluating educational outcomes. Quezon City:

Rex Printing Company Inc.
199
Chapter 8
Standardized Tests
Objectives
1. Characterize standardized tests.

2. Determine the classification of tests.
3. Follow procedures in constructing norms.
4. Follow standards in test administration and preparation.
Lessons
1 What are standardized tests?

2 Interpreting Test Scores Through Norm and Criterion Reference
3 Standards in Educational and Psychological testing
200
Lesson 1
What are Standardized Tests?
A test is a tool used to measure a sample of behavior. Why did we say “a sample” and not
the entire behavior? A test can only measure part of a behavior. A test CANNOT measure the
entire behavior of a person, or characteristics measured. For example in a personality test, you
cannot test the entire personality. In t case of NEO-PI, the subscale on extrovertness can only
measure part of extrovertness. As an implication, during pre-employment testing, before an
applicant is accepted they administer a series or battery of tests to well represent the behavior
that needs to be uncovered. In school admission, the university or college require student
applicants’ grades, entrance exam, essay, recommendation letter, and bioprofile to decide on the
suitability of the student. A test can never measure everything. There are proper uses of tests.
What do you need to consider in a test?
As discussed in chapter 3 a test shoud be valid, reliable and can discriminate ability
before one should use it. Validity means if the test is measuring what it is suppose to measure.
Reliability means if the test scores are consistent when the same test or a test with another test.
Discrimination is the ability of the test to determine who learned and who does not.
What is the purpose of standardization?
The primary purpose of standardization is to (1) facilitate the development of tools; and
(2) to ensure that results from a test are indeed reliable and therefore can be used to assign
values/ qualities to attributes being measured (through the established norms of a said test).
What makes tests standardized?
The unique characteristic of a standardized test which differentiates it from other tests
are: (1) Uniform procedures in test administration, and scoring, and (2) having establishment of
norms.
Uses of Tests
1) Screen applicants for jobs and educational/training programs
2) Classification and placement of people in different contexts
3) Educational, vocational, and personal counseling and guidance
4) Retention/dismissal/promotion/rotation of students/employees in programs/jobs
5) Diagnosing and prescribing treatments in clinics/hospitals

201
6) Evaluating, cognitive, intra and interpersonal changes due to educational and

psychotherapeutic programs
7) Conducting researcher on individual development over time and on effective of a new

program
Classifications of Tests
Standardized VS. Non-Standardized. Standardized tests have fixed directions for scoring
and administering. Can be purchased with test manuals, booklet, answer sheet. It was sampled to
those who are considered in the norm. Non-Standardized or teacher-made test is intended for
classroom assessment. it is used for classroom purposes. It intends to measure the behavior in
line with the objectives of the course. Examples are Quiz, Long Test, Exams, etc. Can a teacher
made test become a standardized test? Yes, as long as it is valid, reliable, and has a norm
Individual Tests VS. Group Tests. Individual Tests are administered to one examinee at a
time. Used for special populations such as children and people with mental disorders. Examples
are Stanford-Binet and WISC. Group Tests are administered to many examinees at a time.
Examples are classroom Tests.
Speed VS. Power. Speed test consists of easy items but time is limited. Power consists of
few pre-calculated difficult item and time is also limited.
Objective VS. Non-Objective/Subjective. Objective tests have fixed objective scoring

standards and commonly has right and wrong answers. Non-Objective/Subjective tests have
variation in responses and with no fixed answers. Examples are essays and Personality Tests.
Verbal VS. Non-Verbal Tests. Verbal consists of vocabulary and sentences. Examples are
Math test with characters. Non-Verbal consists of puzzles and diagrams. Examples are Abstract
reasoning and projective tests. Performance Test requires to manipulate objects.
Cognitive VS. Affective. Cognitive measures the process and products of natural ability.
Example are intelligence, aptitude, memory, problem solving. Achievement Test assesses what
has been learned in the past. Aptitude Test focuses in future and what the person is capable of
learning. Example is Mechanical Aptitude Test, Structural Visualization. Affective assesses
interest, personality, and attitudes, non-cognitive aspects.
202
Lesson 2
Interpreting Test Scores Through Norm and Criterion Reference
The process of test standardization involves having uniformity of procedure and an

established norm.
Uniformity of procedure means that the testing conditions must be same for all.
Directions are formulated. Time limit is considered. Preliminary demonstration on administering
the test. In administering consider the rate of speaking and tone of voice, inflection, pauses, and
facial expression. Inflection is a change in the form of a word that reflects a change in
grammatical function. Test administration should be uniform to maintain constancy across
testing groups and minimizing measurement errors.
Having an establishing norms for a test means obtaining a normal or average
performance in the distribution of scores. A normal distribution is obtained by increasing the
sample size. A norm is a standard and it is based on a very large group of samples. Norms are
reported in the manual of standardized tests. Aside from the norm the test manual includes
description of the test, how to administer the test, reminders before testing, dialogues of the
person administering the test, how to interpret the test
A normal distribution found in the manual takes the shape of a bell curve. It shows the
number of people within a range of scores. It also reports the percentage of people for particular
scores. The norm is used to convert a raw score in to standard scores for interpretability.
What is the use of a norm?

(1) A norm is a basis for interpreting a test score
(2) You use a norm to interpret a particular score
There are two ways of interpreting scores: Norm-Reference and Criterion-Reference.

Criterion reference is a given set of standards. The scores are then compared on the given
criterion. For example, in a 20 item test: 16-20 high, 11-15 average, 6-10 poor, 0-5 low. In a
criterion-reference the score is interpreted for a particular cut off scores. Most commonly the
grading system in schools are criterion reference where 100-95 is outstanding, 90-94 is very
good, 85-89 is good, 80 to 84 is satisfactory, 75 to 79 needs improvement, and 74 and below are
poor.
The interpretation for norm reference would depend on the distribution of scores of the
sample. The mean and standard deviations are computed and it will approximate the middle area
of the distribution. The standing of every individual in a norm reference is based on the mean ans
standard deviation of the sample. Standardized tests commonly interpret scores using norm
reference where they have standardized samples.
The Normal Curve and Norms
Creating norms are usually done by test developers, psychometricians, and other practitioners
in testing. When a test is created, it is administer to a large group of individuals. This group of
individuals are the target sample where the test is intended for. If the test can be used for a wide
203
range of individuals, then a norm for a specific group possessing that characteristic needs to be
constructed. It means that a separate norm is created for males and females, for ages 11-12, 13-
14, 15-16, 17-18 and so on. There should be a norm for very kind of user for the test in order to
interpret his position in a given distribution. A variety of norms is needed because one cannot
use a norm that was made for 12 years old and use it for 18 years old because the ability of an 18
years old is different from the ability of a 13 years old. If a 21 years old need to take a test but
you DO NOT have a norm for a 21 years old, then you have to create a norm for a 21 years old.
There is a need to create norms for certain groups because the types of groups involved are
different from one another in terms of curriculum, ability, etc. For example, majority of
standardized tests used in the Philippine setting are from the west. This means that the content
and norms used are based in that setting. Thus, there is a need to create norms specifically for
Filipinos. Another concern in developing norms is that it expires across a period of time. Norms
created in the 1960’s cannot be used to interpret the scores of test takers of 2008. Thus, a norm
needs to be created every year.
In creating a norm, the goal is to come up with a normal distribution of scores that is
typical of a normal curve. A normal distribution is asymptotic and symmetrical. Asymptotic
means that the two tails of the normal curve do not touch the base which extends to infinity. The
sides of the normal distribution are symmetrical. The normal curve is a theoretical distribution of
cases where the mean, median, and mode are the same and in which distances from the mean can
be measured in standardized distances such as standard deviation units or z scores. The z-scores
are standardized values transformed from distributions that are not distributed normally. There
are 6 standard scores presented for each area in a normal curve. The z-score ranges from -3 to +
3, with a mean of 0 and the standard deviation is 1.
204
Steps in creating a norm
Suppose that a general ability test with 100 items was constructed and it was pilot tested
to 25 participants. The goal is to construct a norm to interpret scores of future test takers
(Generally 25 respondents are not enough to create a norm).
96 74 64 50 76
83 80 92 85 91
59 68 76 75 69
64 87 71 81 83
73 67 68 70 75
1. Compute for the Range

R = (highest score lowest score) + 1
R = (96-50) + 1
R = 47
2. Compute for the interval size (i)
R
i=
10
47
i=
10
i = 4.7 (5 will be the interval size)
3. Start the class interval with the score that is divisible to your interval size. The lowest score
which is 50 is divisible by 5 (interval size), so the class interval can start at 5.
205
4. Create the Frequency Distribution Table (FDT)
Class interval Tally Frequency Relative Cumulative Cumulative

(ci) (f) frequency Frequency Percentage
(rf) (cf) (cP)
95-99 | 1 4% 25 100
90-94 || 2 8% 24 96
85-89 || 2 8% 22 88
80-84 |||| 4 16% 20 80
75-79 |||| 4 16% 16 64
70-74 |||| 4 16% 12 48
65-69 |||| 4 16% 8 32
60-64 || 2 8% 4 16
55-59 | 1 4% 2 8
50-54 | 1 4% 1 4
Σf=25
Divisible by 5 Should f Copy the cf

Count the scores have a rf = cf = X 100
that belongs to N lowest f then N
total of 25, add each f
each class interval N=25 going up
The frequency (f) and relative frequency (rf) indicates how many participants scored within a
class interval. The cumulative percentage (cP) indicates the point in a distribution that has a
given percent of the cases below it. For the example, an examinee who scored 87 means that
88% of the participants are below his score and there are 22% of the cases above his score.
206
midpoint
When a histogram is created for the data set, it typifies a normal distribution. To
determine if a distribution of scores will approximate a normal curve, there are indices to be
assessed:
1. The mean and median should have approximately close values.

2. The computed skewness (sk) is close to zero
3. The computed kurtosis (K) is close to 0.256
Computation of the mean and median:
ΣX 1877
X = X = X = 75.08
N 25
N (.5) − cf 25(.5) − 12
C50 = cb + (i ) C50 = 74.5 + (5) C50 = 75.13
f 4
The 50% of the N=25 is 12.5, given this proportion, select from the cumulative frequency (cf) in
the frequency distribution table that is close to 12.5 but will not exceed it. This value would be
12 which will then be used as cf in the formula. The f used is 4 because given a cf of 12 a
frequency of 4 is still needed to approximate 12.4. The value 4 is taken as the frequency above
12. The i value is the interval size which is 5. To determine cb which is the class boundary, get
the corresponding upper limit value of 12 in the class interval. This upper limit value is 74 (70 is
the lower limit value). The boundary between 74 and the next limit which is 75 is 74.5, therefore
74.5 will be used as the cb.
The value of the mean (75.08) and median (75.13) are close. It can be assumed that the
distribution is normal.
207
Estimating Skewness. Skewness of a distribution refers to the tail of the distribution. If

the tails are asymptotic, then the distribution is said to be normally distributed with skewness of
0. A distribution which is not normal is said to be skewed. If the tail goes to the right this type of
skewness is positive. If the tail is on the left, then it is negatively skewed.
Notice that in a skewed distribution, the mean and median are not equal. In a positively skewed
distribution, the mean is pulled by the extreme scores on the right having a higher value than the
median ( X > C50 ) . While in the negative skewed curve, the mean is pulled by the extreme
scores in the left side having a median with higher value ( X < C50 ) .
Formula to determine Skewness:
3( X − C50 )
sk =
sd
Where sd is the standard deviation, X is the mean, and C50 the median. In the previous section the
mean and median are already computed with values 75.08 and 75.13, respectively. To determine the
value of the standard deviation, the formula below is used:
( ΣX ) 2 (1877) 2
ΣX 2 − 143713 −
sd = N sd = 25 sd = 10.78
N −1 25 − 1
Where ΣX is the sum of all scores, ΣX2 is the sum of squares, and N is the sample size. ΣX2 is
obtained by squaring each score and then summate it. It will give a sum of 143713 from the
given data. Substitute the values in the formula:
3( X − C50 ) 3(75.08 − 75.13)

sk = sk = sk = -0.014
sd 10.78
The value of the sk is almost 0 which indicates that the distribution is normal.
208
Estimating Kurtosis. Kurtosis refers to the peakedness of the curve. If a curve is peaked
and the tails are more elevated, the curve is leptokurtic, if the curve is flattened then it is said to
be platykurtic. A normal distribution is somewhat mesokurtic.
Formula for Kurtosis:
QD
Kurtosis =
P90 − P10
 Q − Q1 
Where QD is the quartile deviation  QD = 3  , P90 is the 90th percentile, and P10 is the
 2 
10th percentile. The formula to determine the median can be used to determine percentile ranks
P. The Q3 is also equivalent to P75 and Q1 is equivalent to P25. There are four estimates of
percentiles needed to determine kurtosis, P75, P25, P90, and P10.
N (.75) − cf 25(.75) − 16
P75 = cb + (i ) P75 = 79.5 + (5) P75 = 85.19
f 4
N (.25) − cf 25(.25) − 4
P25 = cb + (i ) P25 = 64.5 + (5) P25 = 67.31
f 4
N (.10) − cf 25(.10) − 2
P10 = cb + (i ) P10 = 59.5 + (5) P10 = 60.75
f 2
N (.90) − cf 25(.90) − 22
P90 = cb + (i ) P90 = 89.5 + (5) P90 = 94.75
f 2
You can now compute for the Quartile deviation (QD):
Q3 − Q1 85.19 − 67.31
QD = QD = QD = 8.95
2 2
209
You can now compute for the Kurtosis:
QD 8.95
Kurtosis = Kurtosis = Kurtosis = -0.26
P90 − P10 94.75 − 60.75
The distribution approximates a normal since the kurtosis value is exactly 0.26.
Interpreting areas in the norm
How many participants are there below a score a 94 in the test?

A score of 94 correspond approximately a percentile rank of 96%. Get the 96% of the total N
which is 25 to determine the number of participants 25(.96) = 24. This means that there are 24
cases below a score of 94.
What is the standard score corresponding to a score of 94? Locate this score in the normal
curve.
X−X
To convert a raw score to a standard z-score the formula z = is used. Just replace
sd
the values in the formula where X is the given score. We use the given data set where the
X =75.08 and sd = 10.78.
94 − 75.08
z= z = 1.76
10.78
A z-score of 1.76 is at this

point in the distribution
which is raw score of 94.
Other standard Scales in a normal distribution

210
Notice that the z-score has a mean of 0 and standard deviation of 1. A T score has a mean of 50
and a standard deviation of 10. For the other scales:
Mean Standard Deviation

CEEB score 500 100
ACT 15 5
Stanine 5 2
Convert a raw score of 94 into T score, CEEB, ACT, and stanine. Given the z value of 1.76 for a
raw score of 94 just multiply the z with the standard deviation of the standard score then add the
mean value.
T score = z (10) + 50 T score = 1.76 (10) + 50 T score = 67.6

CEEB = z (100) + 500 CEEB = 1.76 (100) + 500 CEEB = 676
ACT = z (5) + 15 ACT = 1.76 (5) + 15 ACT = 23.8
Stanine = z (2) + 5 Stanine = 1.76 (2) + 5 Stanine = 8.52
A raw score of 94 has an equivalent 67.6 T score, 676 CEEB, 23.8 ACT, and 8.52 stanine.
Once a score is converted into a standard score, a score can be interpreted based on its position in
the normal curve. For example a raw score of 94 is said to be above the average given that its
location surpassed the area of the mean.
Areas of the Normal Curve

211
The normal distribution since it is symmetrical has constant areas. When a cutoff is made using
the z-score the following areas are as follows.
The areas show that from the mean to a z score of 1 the area covered is 34.13% which is
also 1 standard deviation away from the mean. From a standard score of -1 to +3 a total area of
68.26% (34.13% + 34.13%) is covered. From -2 to +2, a total area of 95.44% is covered in the
curve. From -3 to +3 a total area 99.72% is covered. The remaining areas of the normal curve is
0.13% each side. The approximate areas of the normal curve for every z-score is found in
Appendix C of the book.
For example, in a given raw score of 94, what is the area away from the mean? Given the
z score of 1.76 for a raw score of 94, look for the value of 1.76 in Appendix C (first column, z
score) gives a value of .4608 which is the area away from the mean. To illustrate, it means that
the area occupied from the mean “0” to a z score of 1.76 occupies 46.06% of the normal
distribution.
3.92%
The shaded area occupies 46.06%

of the distribution from 0 to 1.76.
How many cases are within the 46.06 area of the distribution?
Just multiply the area .4606 with N (.4606 X 25) gives 12 participants.
What is the area above a z score of 1.76?

To determine the area above 1.76, one way is to look at Appendix C on the area in
smaller proportion. Locate the z value of 1.76 and the value corresponding to area in smaller
212
proportion is .0392. This means that the area remaining on the smaller right of the normal curve
is 3.92%.
Another solution would be to subtract the shaded area .4606 to .5 which is half of the
distribution. The answer would be the same which is .039.
Another solution would be to subtract the shaded area to the entire area of the curve (1 -
.4606) which gives a value of 0.539. You still need to subtract 0.5 for the remaining area (.5 -
.5394) which will give an area of .039.
1) How many cases are within the 68.26% of the Norm distribution.
Multiply N= 25 to .6826. Therefore, 25 x .6826 gives 17 cases.
17.06 people ranging

in the area of the
68.26%
normal distribution.
2) Given a score of 87 and another score of 73. How many people are between the two scores?
25-17= 8 cases
Convert 87 and 73 into z scores ( X =75.08, sd = 10.78). A score of 87 corresponds to a z score
of 1.11, and a score of 73 corresponds to a z score of -0.19. A z score of 1.11 is located on the
right side of the curve above the mean and a z score of -0.19 is on the left side below the mean
because the negative sign. The areas away from the mean can be located for each z score and add
these areas to determine the proportion. Then multiply this proportion with N=25 to determine
the cases in between the two scores.
(.3643 + .0753) = .4396 x 25 = 11 cases
.0753 .3643
213
Summary of the Distinction between Criterion and Norm Reference
Criterion-Referenced Norm-Referenced
Dimension
Tests Tests
To determine whether each student To rank each student with respect to
has achieved specific skills or the
concepts. achievement of others in broad areas
Purpose of knowledge.
To find out how much students
know before instruction begins and To discriminate between high and low
after it has finished. achievers.
Measures specific skills which
make up a designated curriculum.
Measures broad skill areas sampled
These skills are identified by
from a variety of textbooks, syllabi,
Content teachers and curriculum experts.
and the judgments of curriculum
experts.
Each skill is expressed as an
instructional objective.
Each skill is tested by at least four Each skill is usually tested by less
items in order to obtain an adequate than four items.
sample of student
Item performance and to minimize the Items vary in difficulty.
Characteristics effect of guessing.
Items are selected that discriminate
The items which test any given skill between high
are parallel in difficulty. and low achievers.
Each individual is compared with
Each individual is compared with a
other examinees and assigned a score-
preset standard for acceptable
-usually expressed as a percentile, a
achievement. The performance of
grade equivalent
other examinees is irrelevant.
score, or a stanine.
Score
Interpretation A student's score is usually
Student achievement is reported for
expressed as a percentage.
broad skill areas, although some
norm-referenced tests do report
Student achievement is reported for
student achievement for individual
individual skills.
skills.
214
Exercise
(True False) 1. The mean of a score is equivalent to zero in a standard z score.
(True False) 2. The mean and the median are equivalent to 0 in a normal curve.
(True False) 3. The 68% percent of the normal distributions are 2 standard deviations away
from the mean.
(True False) 4. The entire area of the normal distribution is 100%.
(True False) 5. The area in percentage from -3 to -2 of the normal distribution is 86.26%
(True False) 6. The extreme area of the normal distribution throughout is 0.13%?
(True False) 7. The area of the normal curve from +2 to -1 is 95.44
(True False) 8. The area from -2 to +1 is the equivalent +2 to +1.
(True False) 9. The mode is found on zero in a normal distribution.
215
Lesson 3
Standards in Educational and Psychological testing
Controlling the Use of Tests
There is a need to control the use of tests due to the issue on leakage. When this happens
it will be difficult to determine abilities accurately. To control the use of test, proper
considerations are ensured: The qualified examiner and procedure in test adminsitration. A
person can be a qualified examiner provided that they undergo training in administering a
particular test. The psychometrician is the one responsible for the psychometric properties and
the selection of tests. The psychometrician also trains the staff on how to administer standardized
tests properly.
A qualified examiner need to follow instructions precisely by undergoing training or
orientation to develop the skill of administering a test. The examiner needs to follow precisely
the test manual. If the examiner largely deviates from the instructions, then it defeats the purpose
of standardization. One of the distinct qualities of standardized measures is the uniformity of
administration. Moreover, the lack of preciseness in following the instructions in the
administration of the test can affect the results of the test.
The examiner should have a thorough familiarity in the tests’ instructions. They should at
least memorize their script even when they introduction themselves to the examinees.
Careful control of testing condition is also important which concerns the environment of
the testing rooms when taking the exam. If there are many groups who will take the exam, the
condition should be the same for all. It includes the lightning, temperature, noise, ventilation, and
facilities. The condition of the testing room can affect the test taking process.
Proper checking procedure should also be taken into consideration. It should be decided
whether the test will be checked by scanning via computer or will be checked manually. Second
round of checking should also be done for verification if the checking is done accurately.
There should also be proper interpretation of results. Some trained examiners have the
skills to make a psychological profile out of the battery tests or several tests administered. The
psychometrician is qualified to write a narrative integrating all test results. In some cases, the
staffs are trained how to write psychological profiles especially if there are occasional test takers.
Security of the Test Content
Tests content should be restricted in order to forestall deliberate efforts to fake scores.
The questionnaires can only be accessed by the psyhometricians. The staff, superiors, or anybody
else is not allowed to have access of the tests. To avoid leakage and familiarity, the
psychometrician can use different sets of standardized test which measure the same
characteristics for different groups of test takers.
Test results are confidential. The examiner is not allowed to show anybody the results of
the exam other than the test taker and the persons who will use it for decision making. Test
results are kept where it can only be accessible to the psychometrician and qualified personnel.
The nature of the test should effectively be communicated to the test takers. It is
important to dispel any mystery and misconception regarding the test. It should be clarified to the
test takers what the test is for purposes of assessment and used for deciding whatever the test is
intended for. The procedures of the test can be reported to test takers in case they are concerned.
216
It is essential for them to know that the test is reliable and valid. Moreover the examiner should
also dispel the anxiety of the test takers to ensure that they will perform to the best of their
ability. After taking the test, feedback on the result of the test should be communicated to the test
takers. It is the right of the test takers to know the result of the test they took. The
psychometrician is responsible for keeping all the records of the results in case the test takers
look for it.
Test Administration
Before the test proper, the examiner should prepare for the test administration. The
preparation involves are memorizing the script, and familiarization with the instructions and
procedures. The examiner should memorize the exact verbal instruction especially the
introduction part. However, there are some standardized test that does not require the examiner
to memorize the instructions and procedures. Some tests permit the examiner to read the
instruction and procedure from the manual.
In terms of preparing for the test materials, it is advisable that the examiner prepares a
day before the test taking day. The test examiner counts the test booklets, the answer sheets,
pencils, prepare the sign boards, stopwatch, other materials, and the room itself. The room
reservation should have been made one month before the test taking. The testing schedule are all
prearranged. The room condition is fixed that includes the ventilation, air-conditions, and chairs.
Thorough familiarity with specific testing procedure is also important and is done by
checking the names of the test takers, the pictures in the test permit should match the examinees
faces. Testing materials provided for administering the test should be tested if they are properly
working such as the stopwatch.
Advance briefing for the proctor is also done through orientation and training on how to
administer the test. The examiner during the test is also responsible for reading the instructions
carefully, take charge of timing, and in-charge of the group taking the exam. They should also
prevent the test-takers from cheating. The examiner checks if the numbers of test-takers
correspond with the test booklet number after the session. Make sure also that test takers follow
instructions such as shading the circle if they are to shade it. In cases of questions that cannot be
answered by the proctor, there is a testing manger nearby that can be consulted.
For the testing condition, the environment should not be noisy. The examiner should be
able to select good and suitable testing rooms that can facilitate a good testing environment for
the test takers. The area or place where the test is administered should be restricted. Noise in the
place should be regulated. Temperature in each room should be kept the same for all rooms. The
room should be free of noise, lights should be bright enough, good seating facilities and other
factors that can negatively affect the test takers as they are taking the exam should be controlled.
There should also be special step to prevent distractions by putting signs saying outside like
“examination going on”, the examiner can also lock the door, or ask assistants outside the room
to tell people that test is going on in that area. Subtle testing conditions may affect performance
on ability and personality tests like the tables, chairs, type of answer sheet, paper and pencil,
computer administration.
217
Introducing the Test to test takers
The test administrator should establish rapport with the test takers. Rapport means the
examiner’s efforts to arouse test takers interest in the test, elicit their cooperation, and encourage
them to be appropriate in response. For ability test, encourage test takers to their best effort to
perform well. For personality inventories, tell test takers to be frank and honest with their
responses to the questions. For projective tests, inform test takers to fully report and make
associations evoked by the stimuli without censoring or editing content. Generally, the test takers
motivate respondents to follow instructions carefully.
Testing Different Groups
For preschool children, the test administrator has to be friendly, cheerful, relaxed, short
testing time, interesting tasks, flexible in scoring. Examples are demonstrated to children on how
to answer each test type.
For grade school students, the test administrator should appeal to their competitive side,
and their desire to do well.
For the educationally disadvantaged, they may not be motivated in the same way as the
usual test takers and the examiner should adapt to their needs. Nonverbal tests are used for deaf
examinees and those who are not able to read and write. Oral tests should be given to examinees
who are having difficulty in writing.
For the emotionally disturbed, test administrators should be sensitive to difficulties the
test takers might have while interpreting scores. Testing should occur when these examinees are
in the proper condition.
For adults, test administrators should sell purpose of test, convince the test taker it’s for
their own interest.
Examiner variables such as age, sex, ethnicity, professional/socio-economic status,
training, experience, personality characteristics and appearance affects the test takers. Situational
variables such as unfamiliar/stressful environment, activities before the test, emotional
disturbance, and fatigue also affect the test takers.
Examples of Standardized Tests
Intelligence tests
IPAT Culture-Fair Test of “g”
The Culture-Fair Test of “g” is a measure of individual intelligence in a manner designed

to reduce the influence of verbal fluency, cultural climate, and educational level. It can be
administered in individuals or in groups. It is a non-verbal test and requires only that examinees
be able to perceive relationships on shapes and figures. It has subtests including series,
classification, matrices, and conditions. There are also three scales of this test. The Scale 1 is
intended for children with ages 4-8 years old and older mentally handicapped people, while scale
2 and 3 are required to be administered in groups. Reliability was obtained and all coefficients
are quite high and have been evaluated across large and widely diverse samples. The difference
in level of reliability between the short from and full test (Form A and B) are sufficiently large to
218
warrant administration of the full test. Scale 2 reliability coefficients are .80-.87 for the full test
and .67 to .76 for the short form. Scale 3 on the other hand has a reliability coefficient of .82-.85
for the full test and .69 to .74 for the short form. The validity used was construct and concurrent
validity. Construct validity in the Scale 2 reported .85 for the full test and .81 for the short form.
For the concurrent validity of Scale 2 reported .77 for full test and .70 for short form. In the Scale
3, construct validity reported was .92 for full test and .85 for short form while concurrent validity
reported .65 for the full test and .66 for the short form. The standardization was done for both
scales. In scale 2 4, 328 males and females were included from the varied regions of US and
Britain and For Scale 3 3,140 American high school students participated from first to fourth
year and young adults.
Otis Lennon Mental Ability Test
This test was developed by Arthur Otis and Roger Lennon and was published by the
Harcourt Brace and World, Inc. in New York on 1957. This test was designed to provide a
comprehensive assessment of the general mental ability for the students in American schools. It
is also developed to measure the student’s facility in reasoning and in dealing abstractly with
verbal, symbolic, and figural test. The content sampling includes a broad range of mental ability.
It is important to take note that it does not intend to measure the innate mental ability of the
students. There are 6 levels of Otis Lennon Mental Ability Test to ensure the comprehensive and
efficient measure of the mental ability available or already developed among students in Grade
K-12. The Primary Level I is intended for the students in the last half of kindergarten, Primary
Level II for the first half of grade 2, elementary I, for the half of grade 2 through grade 3,
Elementary II for Grade 4-6, Intermediate for grade 7 to 9 and Advance for grade 10-12. The
norm was obtained by getting 200,000 students from 117 school systems in the 50 states
participated in the National Standardization program. There were 12000 pupils from grade 1-12
while 6000 were from kindergarten. For the reliability, Split-half was used in which the
computed reliabilities range from .93 (Elem I) to .96 (intermediate). KR#20 or the Kuder-
Richardson also obtained above .93 (Elem I) to .96 (Intermediate) for reliability coefficients. Still
in the alternate forms of reliability range from .89 (Elem II) to .94 (Intermediate) for reliability
coefficients. As for the validity, school grades and scores on achievement test are computed.
Moreover the relationship between OLMAT and other accepted mental ability and aptitude test
were computed.
Otis Lennon School Ability Test
This test was developed by Arthur Otis and Roger Lennon and was published by the
Harcourt Brace and Jovanovich, Inc. in New York on 1979. It was developed to give an accurate
and efficient measure of the abilities needed to attain the desired cognitive outcomes of formal
education. It intends to measure the general mental ability or the Spearman’s “g”. It was
modified by Vernon based on the postulate two major factors or components of “g” which are
the verbal-educational and practical mechanical. However, this test focused on the verbal-
educational factor through a variety of tasks that call for the application of several processes to
verbal, quantitative and pictorial content. OLSAT was organized in five levels which includes
Primary Level I for grade 2 students, Primary level II for grades 2 and 3, Elementary for Grades
4 and 5, Intermediate, for grades 6-8 and Advance for grades 9 through 12. Each level is
219
designed to obtain reliable and efficient measurement to most students in which it is intended.
For each level, there are two parallel forms of the test; the Forms R and S were developed. Items
in these two forms are balanced in terms of content, difficulty and discriminatory power. These
two forms also obtained comparable results. A norm composed of 130000 students in 70 school
systems enrolled in Grades 1-12 from American schools was used for standardization. For the
reliability of the test, Kuder-Richardson yielded .91 to .95 reliability coefficients. Test-retest
reliability was also utilized and obtained .93 to .95 reliability coefficients. Lastly, standard error
of measurement was also computed wherein 2/3 of scores fell within +/- 1 standard error of
measurement from “true scores” and 95% fell within standard error of measurement from “true
scores”. For the validity, the OLSAT was compared to teacher’s grade and got .40- .60 and
median of .49. OLSAT was also correlated to Achievement test scores.
Raven’s Progressive Matrices
This test was originally developed by Dr. John C. Raven and was published by the U.S.
Distributor: The Psychological Corporation in 1936. It is a test of abstract reasoning which is a
multiple choice type. It was designed to measure the ability of a person to form perceptual
relations. Moreover it intends to measure a person’s ability to reason by analogy independent of
language and formal schooling. This test is a measure of Spearman's g. It is consisting of 60
items which are arranged in five sets (A, B, C, D, & E) of 12 items each. Each item contains a
figure with a missing piece. There are either six (sets A & B) or eight (sets C through E)
alternative pieces to complete the figure, only one of which is correct. Each set involves a
different principle or "theme" for obtaining the missing piece, and within a set the items are
roughly arranged in increasing order of difficulty. The raw score is converted to a percentile rank
through the use of the appropriate norms. This test is intended for people with age ranging from
6 up to adult. The matrices are offered in three different forms for participants of different ability
which includes the Standard Progressive Matrices, the Colored Progressive Matrices, and the
Advanced Progressive Matrices. The Standard Progressive Matrices were the original form of
the matrices and were first published in the year 1938. This test comprises five sets (A to E) of
12 items each with items within a set becoming increasingly difficult. This requires ever greater
cognitive capacity in order to encode and analyze information. All of the items are presented in
black ink on a white background. There is also Colored Progressive Matrices Designed for
younger children, the elderly, and people with moderate or severe learning difficulties. This test
is consists of sets A and B from the standard matrices, with a further set of 12 items inserted
between the two, as set Ab. Most of the items are presented on a colored background so that the
test will appear visually stimulating for participants. On the other hand the very last few items in
set B are presented as black-on-white so that if participants exceed the tester's expectations,
transition to sets C, D, and E of the standard matrices is eased. Another form is Advanced
Progressive Matrices which contains 48 items, presented as one set of 12 (set I), and another of
36 (set II). Items here are also presented in black ink on a white background. Items become
increasingly difficult as progress is made through each set. The items in this form are appropriate
for adults and adolescents of above average intelligence. The last two forms of matrices has been
published on were published in 1998. In terms of establishing the norms, the standard sample
included are: British children between the ages of 6 and 16; Irish children between the ages of 6
and 12; military and civilian subjects between the ages of 20 and 65. Some more others includes
sample Canada, the United States, and Germany. The two main factors of Raven's Progressive
220
Matrices are the two main components of general intelligence (originally identified by
Spearman): Eductive ability (the ability to think clearly and make sense of complexity) and
reproductive ability (the ability to store and reproduce information). To determine reliability, the
split-half method and KR20 estimates values ranging from .60 to .98, with a median of .90. Test-
retest correlations was also used and obtained coefficients range from a low of .46 for an eleven-
year interval to a high of .97 for a two-day interval. The median test-retest value is
approximately .82. Raven provided test-retest coefficients for several age groups: .88 (13 yrs.
plus), .93 (under 30 yrs.), .88 (30-39 yrs.), .87 (40-49 yrs.), .83 (50 yrs. and over). For test
validity the Spearman used the SPM to be the best measure of g. Through the evaluation using
factor analytic methods which were used to define g initially, the SPM comes as close to
measuring it as one might expect. Majority of studies which have factor analyzed the SPM along
with other cognitive measures in Western cultures report loadings higher than .75 on a general
factor. Moreover, concurrent validity coefficients between the SPM and the Stanford-Binet and
Weschler scales range between .54 and .88, with the majority in the .70s and .80s.
SRA Verbal
This test is a general ability test which measure the individual’s overall adaptability and
flexibility in comprehending and following instructions and in adjusting to alternating types of
problems. It is designed to use on both school and industry. It has two forms, A and B that can
also be sued at all educational levels from junior high school to college at all employee levels
from unskilled laborers to middle management. However, it is intended only for persons with
familiarity on English Language. To determine the general ability of persons who speak foreign
language or of illiterates, a non-verbal or pictorial test should be used. The items in this test has
two types, the vocabulary (linguistic) and arithmetic reasoning (quantitative). This test is
intended for 12 to 17 years old. Reliability was determined and reported that the coefficients are
in the high .70s for all the scores- linguistic, qualitative and total. The means were also found to
be very similar. For the validity of the test, SRA is correlated with the other tests particularly in
the HS placement Test (r=.60) and in Army General Classification Test (r=.82).
Watson Glaser Critical Thinking Appraisal
This test was designed to measure the critical thinking of a person. This test was a series
of exercises which require the application of score of the important abilities involved in thinking
critically. It includes problems, statements, arguments, and interpretations of data similar to
those which a citizen in democracy might encounter in daily life. It has two forms, the Ym and
the Zm which also consist of 5 subtests. These subtests were designed to measure different and
interdependent aspects of critical thinking. There were 100 items and it is not a test of speed but
a test of power. The five subtests are inference, recognizing assumptions, deduction,
interpretation and evaluation of arguments. Inference consists of 10 items and the students are
display the ability to discriminate among the degrees of truth or falsity of inferences drawn from
given data. The recognizing assumption (16 items) on the other hand allows the students to
recognize unstated assumptions or presuppositions which are taken in given statements or
assertions. Next, deduction (25 items) tests the ability to reason deductively from given
statements or premises and to recognize the relation of implication between prepositions. Fourth,
interpretation, measures the ability to weigh evidence and to distinguish between generalizations
221
from given data are not warranted beyond a reasonable doubt and generalization which although
not absolutely obtain or necessary do seem to be warranted beyond a reasonable doubt. Lastly,
evaluation of arguments measures the ability to distinguish between arguments which are strong
and relevant and those which are weak or irrelevant to a particular question or issue. For the
standardization of the test, norm was set. With this, 4 grade levels were included which are
grades 9, 10, 11 and 12. There was a total of 20,312 students participated. High schools had to be
a regular public institution in a community of 10 000-75000 with a minimum of 100 students.
This was done to avoid the biasing influences associated with extremely small schools and with
specialized High school found in some very large systems. The reliability was determined using
the split-half. The computed reliability coefficients were .61, .74, .53, .67 and .62 for are
inference, recognizing assumptions, deduction, interpretation, and evaluation of arguments
respectively in the Ym form. While for the Zm form, the reliability coefficients were .55, .54,
.41, .52 and .40 for are inference, recognizing assumptions, deduction, interpretation, and
evaluation of arguments respectively. Validity was then determined through content and
construct validity. The indication for the validity was the extent to which the critical thinking
appraisal measures a sample of specified objective of such instructional programs. Moreover, for
the construct validity, various forms of test intercorrelation obtained .21-.50 and .56-.79 was the
correlation coefficient computed for the correlation of the subtests to the appraisal as a whole.
Achievement Tests
Metropolitan Achievement Test
This test was designed to provide an accurate and dependable data concerning the
achievement of the students on important skills and content areas of the school curriculum. This
aims to base on the theories that achievement test should asses that is being taught in the
classrooms and has been extended to include the first half of kindergarten and Grade 10-12. It is
a two-component system of achievement evaluation both designed to obtain both norm-
referenced and criterion referenced information. The first one is the instructional component
which is designed for classroom teachers and curriculum specialists. This is an instructional
planning tool that provides prescriptive information in the educational performance of individual
students in terms of specific instructional objectives. There is a separate instructional battery
under this which includes reading, mathematics, and language all available in JI and KI forms.
The other one is the survey component which provides the classroom teacher with considerable
information about the strengths and weaknesses of the students in the class in the importance
skill and content areas of the school curriculum. Under this are 8 overlapping batteries covering
the age range from K-12. This also includes reading, mathematics, and language. The norm was
set and participants were selected to represent the national population in terms of school system
enrolment, public versus non-public school affiliation. Geographic design, socio-economic status
and ethnic background. There were 550 students and there were 10% public schools from the
Metropolitan Population and 10% also from the national population. For the Socio-economic
status, 54% were from metropolitan, and 52% from national population, all adults graduated and
high school. Reliability was computed using KR#20 and obtained .93 for reading, .91 for
mathematics, .88 for language. The basic battery was .96. Also, Standard error of measurement
was also computed and yield 2.8 for reading, 2.9 for mathematics, 3.4 for language and the basic
battery is 5.3. Validity was determined through a content validity with a belief that the objective
222
and item should corre4spind to the school curriculum. With this in mind, compendium of
instructional objectives was made available.
Stanford Achievement Test
This test was designed by Gardner, Rudman, Karlson, & Merwin in 1981. This test was a
series of comprehensive test which is developed to assess the outcomes learning at different
levels in educational sequences. This measures the objectives of the general education from
kindergarten through first year college. Its series include SESAT Stanford Early School
Achievement Test and TASK Stanford Test of Academic Skills. SAT is intended for primary,
intermediate, and junior high school. It assesses the essential learning outcomes of the school
curriculum. It was first established in 1923 and undergone several revisions until 1982. These
revisions were done to have a close match between test content and learning practices, to provide
norms that will have an accurate reflection of the performance of students in different grade
levels and achieve modern ways of interpreting the scores which result in improvement in
measurement technology. SESAT is for children in kindergarten and grade 1. This test measures
the cognitive development of children upon admission and entry into school in order to establish
a baseline where learning experiences may best begin. On the other hand, TASK was intended
for grade 8 to 13 students (first year college). This intends to measure the basic skills. The level I
of TASK is for grades 8-12 which measures the competencies and skills that are desired by the
adult social level, while Level II is for grades 9-13 and measures the skills that are requisite to
continued academic training. SAT contains 8 subtests which include reading comprehension,
vocabulary, listening comprehension, spelling, language, concepts of numbers, math
computations, math applications, and science. Reading comprehension is the measure of
understanding skills wherein textual (typical found in books), functional (printed found in daily
life), and recreational (reading for enjoyment such as poetry and fiction) were included.
Vocabulary is the measure of the pupil’s language competence without having to read prior the
test. Listening comprehension is the subtest which evaluates the ability of the student to process
information that has been heard. Spelling tests the ability of the student to identify the misspelled
words from a group of four words. Language test has three parts which are the proper use of
capital letters, use of punctuation marks, and appropriate use of the parts of speech. Concept of
number includes the understanding of the student with the basic concepts about numbers. Math
computations include the multiplication and division of whole numbers, operations to fractions,
decimals and percents. Math application tests the student’s ability to apply the concepts they
have learned to problem solving. And lastly, science, measures the ability of the students to
understand the basic physical and biological sciences. One of the items in SAT under vocabulary
is “when you have a disease, you are ____” a. sick, b. rich, c. lazy, d. dirty. Te reliability of the
test was obtained through internal consistency, KR#20 (computed r= .85- .95), standard error of
measurements and alternate forms of reliability. For the validity, the test content was compared
with the instructional objectives of the curriculum.
Aptitude Tests
223
Differential Aptitude Test
This test was designed to meet the needs of the guidance counselors and consulting
psychologists, whose advise and ideas where sought in planning for a battery which would meet
the accurate standards and be practical on daily use in schools, social agencies, and business
organizations. The original forms (forms A and B) were developed in 1947 with the aim to
provide an integrated scientific and well-standardized procedure for measuring the abilities of the
boys and girls in grade 8-12 for the purposes of educational and vocational guidance. It was
primarily for junior and senior high school. It can also be used in educational and vocational
counseling of young adult out of school and selection of employees. This test was revised and
restandardized in 1962 for the forms L and M and in 1972 for forms S and T. Included in the
battery of test for DAT are verbal reasoning, numerical ability, abstract reasoning, clerical speed
and accuracy, mechanical ability, space relations and spelling. The verbal reasoning measures
the ability of the student to understand concepts that were framed in words. Numerical ability
subtest tests the understanding of the students of numerical relationships and facility in handling
numerical concepts which includes arithmetic computations. Abstract Reasoning intends as a
non-verbal measure of the student’s reasoning ability. Clerical speed and accuracy intends to
measure the speed of response in simple perceptual task including simple number and letter
combination. Mechanical ability test is the constructed version of the Mechanical
Comprehensive Test (but it is easier) and measure mechanical intelligence. Space and relations
measure the ability to deal with concrete materials through visualization. Lastly, spelling
measures the student’s ability to detect errors in grammar, punctuations, and capitalizations. The
norm was obtained through percentiles and stanines. 76 school districts are included and test the
grade 8-12 students in Schools at District of Columbia. Schools with 300 or more students each
were included. Small school district’s entire enrollment is grade 8-12 also participated. For the
large schools district, representative were included taking into consideration the school
achievement and racial composition. All in all there were 14, 049 8th grade students, 14, 793
grade 9 students, 13,613 10th grade students, 11,573 11th grade, and 10,764 12th grade students.
The reliability was computed through split-half and get the reliability coefficients. Validity was
determined and it can be said that the coefficient presented demonstrates the utility of
Differential Aptitude Test for educational guidance. Each of the tests is potentially useful as to
what the expectancy tables evidently show the validity coefficient.
Flanagan Industrial Test
This test is a set of 18 short tests designed for use with adults in personnel selection
programs for a wide variety of jobs. The tests are short and self-administering. The FIT battery
measures 18 subscales including arithmetic, assembly, components, coordination, electronics,
expression, ingenuity, inspection, judgment, and comprehension, mathematics and reasoning,
mechanics, memory, patterns, planning, precision, scales, tables, and vocabulary. Arithmetic
measures the accuracy in working with numbers. Assembly measures the ability to visualize the
appearance of an object assembled from separate parts. Component is the ability to locate and
identify important parts of a whole. Coordination tests the coordination of arms and hand.
Electronics measures the understanding and electronic principles and analyze diagrams of
electrical circuits. Expression is the ability to feel and having the knowledge of correct English,
ability to convey ideas in writing and talking. Ingenuity refers to being creative and inventive
224
and having the ability to devise procedures equipment and presentations. Inspection is the ability
to spot flaws and imperfections ion series of articles accurately and quickly. Judgment and
comprehension is the ability to read with understanding and use good judgment n interpreting
materials. Math and reasoning refers to the understanding basic math concepts and ability to
apply in solving certain problems. Mechanics is the ability to understand mechanical principles
and analyze mechanical movements. Memory tests the learning and recalling ability in terms of
association. Patterns refer to the ability to perceive and reproduce simple pattern outlines
accurately. Planning is the ability to foresee problems that may arise and anticipate the best order
for carrying out steps. Precision refers to the ability to make appropriate figure movements with
accuracy. Scales is the ability to read and understand what the scales graphs and charts are
conveying. Tables refer to the ability to read and understand tables accurately and quickly.
Vocabulary refers to the ability to choose the tight terms to convey ones idea. The standard
sample in this test are 12th grade students. The reliability of the test was determined and reported
the reliability coefficients ranging from .50-.90 from the individual test. When FIT was
correlated with FACT the range was .28 (memory) to .79 (arithmetic). For the validity of the test,
it is said that many of the short test has fairly substantial reliability coefficients ranging from .20
to .50 using step wise multiple regression. It is also found that 5 of the three tests namely, Math
and reasoning, Judgment and comprehension, Planning, Arithmetic and Expression yield
multiple correlation of .5898 with fall semester GPA. The first four tests along with vocabulary
and precision provide a multiple correlation of .47 in spring semester GPA. In general, multiple
correlation vary from .57- to.40
Personality Test
Edwards Personal Preference Schedule
This test was created by Allen L. Edwards and was published by The Psychological
Corporation. This test is an instrument for research and counseling purposes. It can provide
convenient measures of independent personality variables. Moreover, it provides measure for test
consistency and profile stability. It is a non-projective personality test that was derived from H.
A. Murray’s theory which measures the rating of individuals in fifteen normal needs or motives.
These needs or motives from Murray’s theory are the statements used in the Edwards Personal
Preference Schedule. It consists of 15 scales including achievement, deference, order, exhibition,
autonomy, affiliation, interception, succorance, dominance, abasement, nurturance, change,
endurance, heterosexuality, and aggression. Achievement is described as the desire of the person
to exert best effort. Deference is the tendency of the person to get suggestions from other people,
doing what is expected praising others conforming and accept other’s leadership. Order is the
neatness and organization in doing one’s work, arranging everything in proper order so
everything will run smoothly. Exhibition is the tendency of saying smart and clever things to
gain other’s praise and be the center of attention. Autonomy is the ability to do whatever desired,
avoiding conformity and making independent decisions. Affiliation is having a lot of friends,
ability to form new acquaintance, and build intimate attachments with others. Intraception is the
tendency to put oneself on other’s shoes, and analyzing other’s behaviors and motives.
Succorance is the desire to be helped by other’s in times of trouble, seeks encouragement and
wants others to be sympathetic to him. Dominance is the tendency of the person to argue with
another’s view, act as a leader in the group thereby influencing others and make group decisions.
225
Abasement is the tendency to feel guilty when someone commits a mistake, accepts blame and
feels the need of confession after a mistake is done. Nurturance is the ability to help friends who
are in trouble, desire to help the less fortunate ones, showing great affection to others and being
kind and sympathetic. Change is the tendency to explore on new things, doing things out off
routine. Endurance is the ability to keep on doing the task until it is finished and sticking on the
problem until it is solved. Heterosexuality is the desire to go out with friends in the opposite sex,
becoming physically attracted to the people in the opposite sex and being sexually excited.
Lastly, aggression is the tendency to criticize others in public, attacking contrary points of view
and making fun of others. This test is intended for college students and adults. To set the norm,
1,509 students in college were included. Norm includes High School graduates and college
training. It was consist of 749 college females and 760 college males. Still, part of the sample in
the norm was adults consisting of male and female households heads who are members of
consumer purchase panel used for market surveys. They were from rural and urban areas of
countries in the 48 states. The consumer panel consisted of 5105 households. For the reliability,
a split-half reliability coefficient technique was used. The coefficients of internal consistency for
1,509 students in the college normative group range from .60 to .87 with a median of .78. A test-
retest stability coefficient with a one-week interval was also conducted. These are based on a
sample of 89 students and range from .55 to .87 with a median of .73. Other researchers have
reported similar results over a three-week period, showing correlations of .55 to 87 with a
median of .73. On the other hand, for validity, the manual reports studies comparing the EPPS
with the Guilford Martin Personality Inventory and the Taylor Manifest Anxiety Scale. Other
researchers have correlated the California Psychological Inventory, the Adjective Check List, the
Thematic Apperception Test, the Strong Vocational Interest Blank, and the MMPI with the
EPPS. In these studies there are often statistically significant correlations among the scales of
these tests and the EPPS, but the relationships are usually low-to-moderate and often are difficult
for the researcher to explain.
Guilford-Zimmerman Temperamental Survey
This inventory is developed for the organizational psychologists, personnel professionals,

clinical psychologists, and counseling professionals in mental health facilities, businesses, and
educational settings. This was developed with the aim to help measure attributes related to
personality and temperament that might help predict successful performance in various
occupations, to identify students who may have trouble adjusting to school and the types of
problems that may occur, to assess temperamental trends that may be the source of problems and
conflicts in marriage or other relationships and provide objective personality information to
complement other data that may assist with personnel selection, placement, and development.
This test provides a nonclinical description of an individual's personality characteristics that can
be used in career planning, counseling, and research. Its subscales include General Activity (G),
Restraint (R), Ascendance (A), Sociability (S), Emotional Stability (E), Objectivity (O),
Friendliness (F), Thoughtfulness (T), Personal relations (P) and Masculinity (M). A high score in
General Activity means having a strong drive and energy. A high score in Restraint may mean
not being a happy go lucky, carefree and impulsive. Ascendance high score means being a ride-
rough-shod over others, typically for works of foremen and supervisors. A high score in
sociability means optimism and cheerfulness. Obtaining high score in objectivity may mean less
egoistic and insensitiveness. High score in friendliness means lack of fighting tendencies, and
226
desires to be liked by others. Obtaining high score in thoughtfulness may pertain to men who
have an advantage in getting supervisory positions. Personal relations high scores mean the high
capability of getting along with other people. High score in masculinity may pertain to people
who behave in ways that are more acceptable to men. Examples of items in this test are “You
like to play practical jokes in others” and “Most people are out to get more than they give”.
Standardization of this test was done by gathering 523 college men and 389 college women in
one southern California University and two junior colleges for all except to trait thoughtfulness.
In the male sample, there were veterans aging 18-30 year old. Reliability was calculated using
KR#20 and obtained reliability ranging from .79 for general activity and .87 for sociability.
However in intercorrelations of ten traits gratifies low reliability coefficients, only two scores are
high, between Sociability and Ascendance and between Emotional Stability and Objectivity. For
the validity, it is believed that what each score measures is fairly well-defined and that the score
represent a confirmed dimension of personality and a dependable descriptive category. Most
impressive validity data have come from the use of inventories with supervisory and
administrative personnel.
IPAT 16 Personality Factors
The 16 Pf was originally developed by Raymond Cattel, Karen Cattel, and Heather Cattle
Help identify personality factors. It can be administered to individuals 16 years and older. There
are 16 bipolar dimensions of personality and 5 global factors. The bipolar dimensions of
Personality are Warmth (Reserved vs. Warm; Factor A), Reasoning (Concrete vs. Abstract;
Factor B), Emotional Stability (Reactive vs. Emotionally Stable; Factor C), Dominance
(Deferential vs. Dominant; Factor E), Liveliness (Serious vs. Lively; Factor F)
Rule-Consciousness (Expedient vs. Rule-Conscious; Factor G), Social Boldness (Shy vs.
Socially Bold; Factor H), Sensitivity (Utilitarian vs. Sensitive; Factor I), Vigilance (Trusting vs.
Vigilant; Factor L), Abstractedness (Grounded vs. Abstracted; Factor M), Privateness (Forthright
vs. Private; Factor N), Apprehension (Self-Assured vs. Apprehensive; Factor O), Openness to
Change (Traditional vs. Open to Change; Factor Q1), Self-Reliance (Group-Oriented vs. Self-
Reliant; Factor Q2), Perfectionism (Tolerates Disorder vs. Perfectionistic; Factor Q3), Tension
(Relaxed vs. Tense; Factor Q4). The global factors are Extraversion, Anxiety, Tough-
mindedness, Independence, and Self-Control. A stratified random sampling that reflects the 2000
U.S. Census was used to create the normative sample, which consisted of 10,261 adults. Test-
retest coefficients offer evidence of the stability over time of the different traits measured by the
16 PF. Pearson-Product Moment correlations were calculated for two-week and two-month test-
retest intervals. Reliability coefficients for the primary factors ranged from .69 (Reasoning,
Factor B) to .86 (Self-reliance, Factor Q2)with a mean of .80. Test-retest coefficients for the
global factors were higher, ranging from .84 to .90 with a mean of .87. Cronbach’s alpha values
ranged from .64 (Openness to Change, Factor Q) to .85 (Social Boldness, Factor H), with an
average of .74. Validity of the 16 PF (5th ed.) demonstrated its ability to predict various criterion
measures such as the Coopersmith Self-esteem Inventory, Bell’s adjustment inventory, and
social skills inventory. Its subscales are correlated well with the factors of the Myers-Briggs
Type Indicator.
Myers-Briggs Type Indicator

227
The Myers-Briggs Type Indicator (MBTI) assessment is a psychometric questionnaire

designed to measure psychological preferences in how people perceive the world and make
decisions. These preferences were extrapolated from the typological theories originated by Carl
Gustav Jung, as published in his 1921 book Psychological Types. The original developers of the
personality inventory were Katharine Cook Briggs and her daughter, Isabel Briggs Myers. This
test is suited for 14 years old requires 7th grade reading level. It has 8 Factors: Extraversion,
Sensing, Thinking, Judging, Introversion, intuition, Feeling, Perceiving. People with a preference
for Extraversion draw energy from action: they tend to act, then reflect, then act further. If they
are inactive, their level of energy and motivation tends to decline. Conversely, those whose
preference is Introversion become less energized as they act: they prefer to reflect, then act, then
reflect again. People with Introversion preferences need time out to reflect in order to rebuild
energy. Sensing and intuition are the information-gathering (Perceiving) functions. They
describe how new information is understood and interpreted. Individuals who prefer Sensing are
more likely to trust information that is in the present, tangible and concrete: that is, information
that can be understood by the five senses. They tend to distrust hunches that seem to come out of
nowhere. They prefer to look for details and facts. For them, the meaning is in the data. On the
other hand, those who prefer intuition tend to trust information that is more abstract or
theoretical, that can be associated with other information (either remembered or discovered by
seeking a wider context or pattern).Thinking and Feeling are the decision-making (Judging)
functions. The Thinking and Feeling functions are both used to make rational decisions, based on
the data received from their information-gathering functions (Sensing or intuition). Those who
prefer Thinking tend to decide things from a more detached standpoint, measuring the decision
by what seems reasonable, logical, causal, consistent and matching a given set of rules. Those
who prefer Feeling tend to come to decisions by associating or empathizing with the situation,
looking at it 'from the inside' and weighing the situation to achieve, on balance, the greatest
harmony, consensus and fit, considering the needs of the people involved. Myers and Briggs
taught that types with a preference for Judging show the world their preferred Judging function
(Thinking or Feeling). So TJ types tend to appear to the world as logical, and FJ types as
empathetic. According to Myers, Judging types prefer to "have matters settled." Those types
ending in P show the world their preferred Perceiving function (Sensing or intuition). So SP
types tend to appear to the world as concrete and NP types as abstract. According to Myers,
Perceiving types prefer to "keep decisions open. The validity of the test was tested and it was
found that unlike other personality measures, such as the Minnesota Multiphasic Personality
Inventory or the Personality Assessment Inventory, the MBTI lacks validity scales to assess
response styles such as exaggeration or impression management. The MBTI has not been
validated by double-blind tests, in which participants accept reports written for other participants,
and are asked whether or not the report suits them, and thus may not qualify as a scientific
assessment. With regard to factor analysis, one study of l29l college-aged students found six
different factors instead of the four used in the MBTI. In other studies, researchers found that the
JP and the SN scales correlate with one another. For reliability, some researchers have
interpreted the reliability of the test as being low, with test takers who retake the test often being
assigned a different type. According to some studies, 39–76% of those tested fall into different
types upon retesting some weeks or years later. About 50% of people tested within nine months
remain the same overall type and 36% remain the same after nine months. When people are
asked to compare their preferred type to that assigned by the MBTI, only half of people pick the
same profile. Critics also argue that the MBTI lacks falsifiability, which can cause confirmation
228
bias in the interpretation of results. The standardization was made using high school, college, and
graduate students; recently employed college graduates; and public school teachers.
Panukat ng Ugali at Pagkatao
This test, also called PUP,w as developed by Virgilio G. Enriquez and Ma. Angeles
Guanzon-Lapena. It was published by the Research Training House in 1975. Panukat ng Ugali at
Pagkatao is a psychological test that can be used for research, employment, and screening of
members and students in an institution. Its reliability is .90 and test-retest reliability result was
.94 (p< .01). It has four trait subscales and each has underlying personality traits. The four trait
subscales are Extraversion or Surgency, Aggreableness, Conscientiousness, and Emotional
Stability. Under extraversion are ambition (+), guts/ daring (+), shyness or timidity (-) and
conformity (-). Ambition is the tendency of a person to act towards the accomplishment of his/
her goal. Guts/daring is the courage which is a very strong emotion from the person within. It
can be related to things that are in risk or danger be it in life, aspect of life and material things.
Shyness or timidity is the trait of being timid, reserved and unassertive. A person that is shy
tends not to socialize with others, does not engage in eye contact and lose trust to oneself so
prefers to be alone. Conformity is the tendency of a person to take into consideration what other
people is saying especially if that person has a higher position to him/ her. A conforming person
tends to disregard one's own opinion. For the agreeableness, the factors are respectfulness (+),
generosity (+), humility (+), helpfulness (+), difficulty to deal with (-), criticalness (+), and
belligerence (-). Respectfulness is the trait of giving value to the person you are taking to
regardless of his/ her position and age. Generosity is the ability to satisfy the needs of others by
giving what they need or want even it is not in accordance of one’s personal desire. Humility is
the trait of showing modesty and humbleness in dealing with other people, not boast of her
accomplishments and status in life. Helpfulness is the desire to attend to others need and fill their
shortcomings. Difficulty to deal with others is the tendency of the person to agree on something
after many attempts of request. Criticalness is the tendency of the person to criticize every small
details of something, giving attention to things that are rarely noticed by others. Belligerence is
the trait of a person of being war freak hot headed, easily angered and frequently encounters
trouble due to short or absence of patience. For the conscientiousness dimension, the
personalities are thriftiness, perseverance, responsibleness, prudence, fickle mindedness, and
stubbornness. Thriftiness is the ability of a person to manage his/ her resources wisely and
conservative in spending money. Perseverance is the persistence of a person to achieve ones goal
and being constant with the things already started until it is finished. Responsibleness is the
capacity to do the task assigned to him/ her and being accountable to it. Prudence is the ability to
make sound and careful decisions by weighing the available options. Fickle mindedness is the
tendency of the person to think twice before finally making up one’s mind and having constantly
changing mind ones in a while. Finally, stubbornness is the determination to done thins despite
any prohibitions, hindrances and objections and hard to convince that he/ she has committed a
mistake. For the fourth dimension which is emotional stability, 4 traits are included which are
restraint (+/-), sensitiveness (-), low tolerance to joking/ teasing (-) and mood (-). Restraint is the
tendency of the person not to show his/ her intense emotion, keeping one’s own feelings as a
self-control strategy. Sensitiveness is the tendency of the person to be easily hurt or affected by
little things said or done that the person does not like. Low tolerance to joking/ teasing is the
tendency of the person to have intense emotion due to teasing or provocation of other people.
229
The mood is the tendency to show unusual attitude or behavior and changing emotion due to an
unexpected event that happened. The last dimension of this test is the intellect or openness to
experience includes 3 personality traits such as thoughtfulness, creativity and inquisitiveness.
Thoughtfulness id the tendency of the person to be so concerned with the future especially
regarding the problems or troubles. Creativity is the natural ability of the person to make or
create something out of local materials or resources, having the ability to express oneself, wide
imagination and high inclination to music arts and culture. Last, inquisitiveness is the trait of the
person to be curious and sometimes intrusive. To be able to make a norm, 3702 ethnic group
were asked to participate: 412 Bicolano, 152 Chabacano, 642 Ilocano, 489 cebuano, 170 Ilonggo,
190 Kapampangan, 513 tagalog, 378 waray, 29 Zambal and 83 others. For the validity of the test,
all items are said to have positive direction. 2 subscales for validity were used, denial (certain
that the respondents will disagree with the statement such as “I never told a lie in my entire life”)
and tradition (certain that the respondents will agree to the statement such as “I would take care
of my parents when I get old”)
Panukat ng Pagkataong Pilipino
This test was developed by Anadaisy J. Carlota, from the psychology department in
University of the Philippines. It was published in Quezon City in the year of 1989. PPP is a 3-
form personality test designed to measure 19 personality dimensions. Each personality
corresponds to subscales which are comprised of homogeneous subset of items. The three forms
are the Form K, form S and the form KS. Form K corresponds to the salient traits for
interpersonal relations. Under this form are 8 personality traits which include thoughtfulness,
social curiosity, respectfulness, sensitiveness, obedience, helpfulness, capacity to be
understanding, and sociability. Thoughtfulness is the tendency to be considerate to others. A
person who is thoughtful tries not to be inconvenient to other people. Social curiosity is the
inquisitiveness about other’s life. A person who is socially curious tends to ask everything to
someone and loves to know everything that is happening around him/ her. Respectfulness is the
tendency of people to recognize one’s belief and privacy. Behavior of respectful person is
concertized by simply knocking on the door first before entering. Sensitiveness is the tendency
of a person to be affected easily by any negative type of criticisms. So, a sensitive person does
not want to hear any negative criticisms from other people. Obedience is the tendency of a
person to do what others demand of him/her. An obedient person tends to follow whatever
commanded to him by others. Helpfulness is the tendency of a person to offer service to others,
extend help and give resources. It is characterized by a person who is always willing to lend
his/her things to others. The capacity to be understanding is the person's tolerant to other people's
shortcomings and when this person is hurt by others; he/she is always ready to listen to
explanations. And lastly, sociability is the ability of the person to easily get along and befriend
with others. In social gatherings or event this person will always take the first move to introduce
himself/ herself to others. The second form of this test is Form S which includes 7 factors such as
orderliness, emotional stability, humility, cheerfulness, honesty, patience, and responsibility.
Orderliness is the neatness and organization in one’s appearance and even in work. The person
the is orderly puts his/her things in proper places. Emotional Stability is the ability of the person
to control his/ her emotions and manage to remain calm even when face in a great trouble.
Humility is the tendency to remain modest despite accomplishments and readily accepts own
mistakes. The person with humble personality does not boast about his/ her successes.
230
Cheerfulness is the disposition of the person to be cheerful and see the happy and funny aspects
of things that happen. A cheerful person is one that who always find funny things about
situations. Honesty is the sincerity and truthfulness of a person. A person that is honest tends to
be tell the truth in every situation regardless of the feelings of others. Patience is the ability to
cope up with daily life's routine and repetitive activities. A patient person is one who responds to
a child's repetitive questions without getting mad. Lastly, responsibility is the tendency of the
person to do a particular task upon own initiative. A responsible person is characterized by not
procrastinating in accomplishing an activity. For the last form of PPP, the Form KS, there are 4
subscales which include creativity, risk-taking, achievement orientation and intelligence.
Creativity is the ability of being innovative, and think of various strategies in solving a problem.
Risk-taking is the tendency to take new challenges despite the unknown consequences. A risk-
taker person is the one who believes that one must take risks to be successful in life.
Achievement-orientation is the tendency of the person to strive for excellence and to emphasize
quality over quantity in every task he/ she does. And lastly, intelligence is the trait of a person to
perceive oneself as an intelligent person. This is also characterize by easily understanding the
material being read. This test can be taken by person with age ranging from 13 and above. It is
already written in Filipino and has translations on English, Cebuano, Ilokano and ilonggo.
During its pretest 245 respondents were included with 13-81 years old. There were more females
then. The reliability was tested through internal consistency reliabilities. All personality
dimensions except Achievement orientation has high reliability. Internal consistency reliability
was done three times. At first top 10 personalities were gotten, then top 12 then top 14. For the
fourth time, top 8 was taken and were included in the inventory. Form K has a mean reliability
coefficient of .69, Form s .81 and Form KS .72. For the validity, construct validity was applied
wherein internal structures of the original version of PP before clustering on 3 forms
intercorrelations among the subscales were obtained. The test was valid because for one, more
positive intercorrelations than negative were obtained. Second, in personality subscales there
were also more positive than negative except for social curiosity and sensitiveness. And lastly,
the magnitude of the correlations were small to moderate although the majority of the subscales
are significant at alpha level of p=.05. The predominance of positive intercorrelation means that
the all of the subscales are measuring the same construct which is personality. This test was
standardized through norming which was developed in two forms including percentiles and
normalized standardized scores with a mean of 50 and standard deviation of 10.
Attitude Tests
Survey of Study Habits and Attitudes
This test was developed in order to help meet the challenge which is: students with high
scholastic aptitude were very poor in schools while the mediocre in the scholastic test were doing
well in school. This test is easily administered study of methods, motivation for studying certain
attitudes toward scholastic activities which are important in the classrooms. The purpose of
developing this is to identify the students whose study habit and attitudes are different from those
of students who earn high grades, to aid understanding of the students with academic difficulties
and to provide a basic for helping such students improve their study habits and attitudes and thus
more fully realize their best potentialities. To add to this, study habits are believed to be a strong
predictor of achievement. This test consists of Form C for college and Form H for high school
231
(grades 7-12). The four basic subscales include delay avoidance, work methods, teacher
approval, and educational acceptance. It has 100 items and can be used as screening instrument,
diagnostic, teaching aid and teaching tool. There were separate norms for both of the Forms. For
Form C, 3054 first semester freshmen enrolled at the following nine colleges were included:
Antioch College, Bowling Green State University, Colorado Reed College, San Francisco State
College, Southwest Texas State College, Stephen F College, Austin State College, Swarthmore
College and Texas Lutheran College. For the Form H 11, 218 students in 16 different towns and
metropolitan arena in America participated: Atacosta Texas (10-12), Austin Texas (10-12), Buda
Texas (7-12), Durango Colorado (10-12), Olen Ellyn, Illinois (9), Gunnison Colorado (10-12),
Hagerstown Maryland (7-12), Marion Texas (7-12), Navarro Texas (7-12), New Brauntels Texas
(7-9), Salt Lake City Utah (7-12), San Marcos Texas (7-12), Seguin Texas (7-2), St. Louis
Missouri (7-12) and Waelder Texas (7-12). The computed reliability coefficients were baes on
the Kuder-Richardson # 8, which ranged from .87 and .89. Using the test-retest method, the
coefficients were .93, .91, .88, .90 for delay avoidance, work methods, teacher approval, and
educational acceptance respectively in the 4 weeks interval. In the 14 week interval, the
reliability coefficients were .88, .86, .83 and .85. For the validity, the criterion used was the one-
semester grade point average GPA. SSHA and GPA were correlated and the result was .27-.66
for men and .26- .65 for women. The average validity coefficients for 10 colleges were .42 for
men and .45 for women. When SSHA was also correlated to ACE (American Council on
educational psychological examination, a scholastic aptitude test was always low. SSHA and
Form C correlated to GPA obtained .25-.45 and the weighted average was .36. SSHA and each
subscale using Fisher’s z-functions .31 for delay, .32 for work methods, .25 for teacher approval
and .35 educational acceptance.
Work Values Inventory
This test intends to meet the need of assessing the goals which motivate man to work. It
measures the values which are extrinsic to as well as those which are intrinsic in work, the
satisfactions with men and women seek in work and the satisfactions which may be the
concomitants or outcomes of work. It seeks to measure these in boys and girls, in men and
women at all age levels beginning with adolescence and at all educational levels beginning with
entry into junior high school. It is both in the variety of values tapped and in the ages for which it
is appropriate, a wide-range values inventory. Its factors are altruism, esthetic, creativity,
intellect stimulation, achievement, independence, prestige, management, economic returns,
security, surroundings, supervisory relations, associates, way of life, and variety. Altruism refers
to the work which enables the person to contribute with the welfare of others. Esthetic is the
works which permits to one to make beautiful things and to contribute to beauty to the world.
Creativity pertains to the work which permits one to invent new things, design new products or
develop new ideas. Intellect stimulation refers to the work which provides opportunity for
independent thinking and for learning how and why things work. Achievement refers to the work
which gives one a feeling of accomplishment in a job well. Independence pertains to the work in
his own way as fasts or as slowly as he wishes. Prestige pertains to the work which gives one
standing in the eyes of other and evokes respect. Management refers to the work which permits
one to plan and lay out work for others to do. Economic returns pertains to the work which pays
well and enables one to have the things he wants. Security pertains to the work which provides
one with certainty of having a job even in the hard times. Surroundings pertains to the work
232
which is carried out under pleasant conditions , not too hot, not too cold, noisy, dirty, etc.
Supervisory relations refer to the work which is carried out under a supervisor who is fair and
with whom one can get along. Associates refer to the work which brings one into contact with
fellow workers whom he likes. Way of life refers to the kind of work that permits to live the kind
of life he chooses and to be the type of person he wishes to be. Variety refers to the work that
provides an opportunity to do different types of job. One of the items in the inventory under
creativity is “Create new ideas, programs or structures departing from those ideas already in
existence.” To set the standards of this test, norm was obtained. The sample were grade 7 (902
females, 925 males), 8 (862 females, 949 males), 9 (844 females, 931 males), 10 (772 females,
859 males), 11 (824 females, 814 males), and 12 (724 females and 672 males). Reliability was
obtained through test-retest method and the reliability coefficients reported were: .83, .82, .84,
.81, .83, .83, .76, .84, .88, .87, .74, .82, and .80 for all the subscales. Validity was also
determined through construct, content, concurrent and predictive validity. Some of the construct
validity were obtained by correlating Altruism subscale to Social Service Scale (r= .67) and to
Social scale of AVL (r= .29). Also Esthetic subscale with Artist key SVIB (r= .55), with artictic
scale of Kuder (r=.48) and with Aesthetic Scale of AVL (r=.08).
Interest Test
Brainard Occupational Preference Inventory
This test was designed to be able to have a systematic study of a person’s interest. It is a
standardized questionnaire that is designed to bring to the fore of the facts about a person with
respect to his occupational interest so that he and his advisers can more intelligently and
objectively discuss his educational and occupational plans. This test is intended for 8-12th grade
students and adults. It requires relatively low reading skills as determined by the readability
formulas. It provides information concerning a vital phase in the complex matter of setting the
person’s vocational plans wisely and planning a program for attaining his goals. It yields score in
6 broad occupational fields for each service. Both females and males obtain scores in fields
identified as commercial, mechanical, professional, esthetic and scientific. Agricultural score is
only for boys and personal service is only for girls. Each field has 20 questions divided among 4
occupational sections. A 5-point scale was used from strongly dislike to strongly like. The
sample in the norm includes 10, 000 students in 14 school system, both males and females from
grade 8 to 12. Reliability was obtained through test-retest and boys got r=.73 in Commercial and
.88 in scientific scores while girls obtained .71 in the commercial and .84 in esthetic. Another
reliability method used was split-half and boys obtained .88 in commercial scores and .95 in
mechanical and scientific scores while girls obtained .82 in commercial scores and .95 in
scientific scores. For the test of validity, Brainard was correlated to Kuder Personal Preference
record and it was found that the latter test measures different in such a way that its focus is in the
interest and forces the respondents to choose three activities indicative of different types of
interest.
Activity
233
A. Look for other Standardized Tests and report its current validity and reliability.
B. Administer the test that you created in Lesson 2 chapter 5 to a large sample. Then create a
norm.
References
(1973). Measuring intelligence with the culture fair test: Manual for scales 2 and 3. Institute of
Personality and Ability Testing, Philippines
Bennett, G.K., Seashore, H.G., & Wesman, A.G. (1973). 5th edition manual for the differential
aptitude test forms s and t. The Psychological Corporation, New York
Brainard, PP. & Brainard, R.T. (1991). Brainard occupational preference inventory manual.
Bird Avenue, San Jose California USA
Briggs, K.C., & Myers, I.B. (1943). The Myers-Briggs Type Indicator Manual. Consulting
Psychologists Press, Inc.
Brown, W.F., & Holtzman, W.H. (1967). Survey of study habits and attitudes: SSHA manual.
The Psychological Corp, East 45th Street New York
Carlota, A. (1989). Panukat ng pagkataong pilipino PPP Manual. Quezon City Philippines.
Edwards, A.L. (1959). Edwards personal preference schedule manual. Psychological

Corporation, New York
Enriquez, V. G. & Guanzon, M.A. (1975). Panukat ng ugali at pagkatao manual. PPRTH-ASP
Panukat na Sikolohikal
Flanagan, J.C. (1965). Flanagan industrial test manual. Science Research Associates, East Street
Chicago Illinois
Gardner, E.F., Rudman, H.C., Karlson, B., & Merwin, J.C. (1981). Manual directions for
administering stanford schievement test. Harcourt Brace and Jovanovich, Inc., New York
Guilford, J.P.,& Zimmerman, W.S. (1949). Guilford zimmerman temperament survey: Manual
of instructions and interpretations. Harcourt Brace and Jovanovich, Inc., New York
Otis, A.S. & Lennon, R.T. (1957). Ottis-Lennon mental ability test manual for administration.
Harcourt Brace and Jovanovich, Inc., New York
Otis, A.S. & Lennon, R.T. (1979). Ottis-Lennon mental ability test manual for administration
and interpretation. Harcourt Brace and Jovanovich, Inc., New York
234
Prescott, G.A., Balow, I.H., Hogan, T.P. & Farr, R.C. (1978). Advanced 2: Metropolitan
achievement tests: Forms JS and KS. Harcourt Brace and Jovanovich, Inc., New York
Raven, J., Raven, J.C., & Court, J.H. (2003). Manual for raven's progressive matrices and
vocabulary scales. section 1: General overview. San Antonio, TX: Harcourt Assessment.
Super, D.E. (1970). Manual: Work values inventory. Houghton Mifflin Company
Thurstone, L.L & Thurstone, T.G. (1967). SRA verbal examiner’s manual. Science Research
Associates, East Street Chicago Illinois
Watson, G., & Glaser, E.M. (1964). Watson-Glaser critical thinking appraisal: Manual for
forms Ym and Zm. Harcourt Brace and Jovanovich, Inc., New York
Chapter 9
235
The Status of Educational Assessment in the Philippines
Objectives
1. Realize the strong foundation of the field of educational assessment in the Philippines.
2. Describe the history of formal assessment in the Philippines.
3. Describe the pattern of assessment practices in the Philippines.
Lessons
1 Assessment in the Early years

2 Assessment in the Contemporary Period and Future Directions
Lesson 1
236
Assessment in the Early Years
Monroe Survey (1925)
Formal Assessment in the Philippines started as mandate from the government to look
into the educational status of the country (Elevazo, 1968). The first assessment was conducted
through a survey authorized by the Philippine legislature in 1925. The legislature created by the
Board of Educational Survey headed by Paul Monroe. Later the board appointed an Educational
Survey Commission who was also headed by Paul Monroe. This commission visited different
schools around the Philippines. They observed different activities conducted in schools around
the Philippines. The results of the survey reported the following:
1. The public school system that is highly centralized in administration needs to be humanized
and made less mechanical.
2. Textbook and materials need to be adapted to Philippine life.
3. The secondary education did not prepare for life and recommended training in agriculture,
commerce, and industry.
4. The standards of the University of the Philippines was high and it should be maintained by
freeing the university from political interference.
5. Higher education be concentrated in Manila.
6. English as medium of instruction was best. The use of local dialect in teaching character
education was suggested.
7. Almost all teachers (95%) were not professionally trained for teaching.
8. Private schools except under the religious groups were found to be unsatisfactory.
Research, Evaluation, and Guidance Division of the Bureau of Public Schools
This division started as the measurement and Research Division in 1924 that was an off
shoot to the Monroe Survey. It was intended to be the major agent of research in the Philippines.
Its functions were:
1. To coordinate the work of teachers and supervisors in carrying out testing and research
programs
2. To conduct educational surveys
3. To construct and standardize achievement tests
Economic Survey Committee
In a legislative mandate in 1927, the director of education created the Economic Survey
Committee headed by Gilbert Perez of the Bureau of Education. The survey studied the
economic condition of the Philippines. They made recommendations as to the best means by
which graduates of the public school could be absorbed to the economic life of the country. The
results of the survey pertaining to education include:
1. Vocational education is relevant to the economic and social status of the people.
237
2. It was recommended that the work of the schools should not be to develop a peasantry class
but to train intelligent, civic-minded homemakers, skilled workers, and artisans.
3. Devote secondary education to agriculture, trades, industry, commerce, and home economics.
The Prosser Survey
In 1930, C. A. Prosser made a follow-up study on vocational education in the Philippines.

He observed various types of schools and schoolwork. He interviewed school officials and
businessman. He recommended in the survey to improve various phases of the vocational
educational such as 7th grade shopwork, provincial trade schools, practical arts training in the
regular high schools, home economics, placement work, gardening, and agricultural education.
Other Government Commissioned Surveys
After the Prosner survey there were several surveys conducted to determine mostly the
quality of schools in the country after the 1930’s. All of these surveys were government
commissioned such as the Quezon Educational Survey in 1935 headed by Dr. Jorge C. Bacobo.
Another study was made in 1939 which is a sequel to the Quezon Educational Surveys which
made a thorough study of existing educational methods, curricula and facilities and recommend
change son financing public education in the country. This was followed by another
congressional survey in 1948 by the Joint Congressional Committee on Education to look into
the independence of the Philippines from America. This study employed several methodologies.
UNESCO Survey (1949)
The UNESCO undertook a survey on Philippine Education from March 30 to April 16,
1948 headed by Mary Trevelyan. The objective of the survey was to look at the educational
situation of the Philippines to guide planners of subsequent educational missions to the
Philippines. The report of the surveys was gathered from a conference with educators and
layman from private and public school all over the country. The following were the results:
1. There is a language problem and proposed a research program.

2. There is a need to for more effective elementary education.
3. Lengthening of the elementary-secondary program from 10 to 12 years.
4. Need to give attention to adult education.
5. Greater emphasis on community school
6. Conduct thorough surveys to serve as basis for long-range planning
7. Further strengthening of the teacher education program
8. Teachers income have not kept pace with the national income or cost of living
9. Delegation of administrative authority to provinces and chartered cities
10. Decrease of national expenditure on education
11. Advocated more financial support to schools from various sources
238
After the UNESCO study, it was followed by further government studies. In 1951, the
Senate Special Committee on Educational Standards of Private schools undertook to study
private schools. This study was headed by Antonio Isidro to investigate the standards of
instruction in private institutions of learning and to provide certificates of recognition in
accordance with their regulations. In 1967, the Magsaysay Committee on General Education that
was financed by the University of the East Alumni Association. In 1960, the National Economic
Council and the International Cooperation Administration surveyed public schools. The survey
was headed by Vitaliano Bernardino, Pedro Guiang, and J. Chester Swanson. Three
recommendations were provided to public schools: (1) To improve the quality of educational
services, (2) To expand the educational services, and (3) To provide better financing for the
schools.
The assessment conducted in the early years were mandated and/or commissioned by
government which was also initiated by the government. The private sectors were not yet
included in the studies as proponents and usually headed by foreign counterparts such as the
UNESCO and the Monroe, and Swanson survey. The focus of the assessments was on the overall
education of the country which is considered national research given the need of the government
to determine the status of the education in the country.
239
Lesson 2
Assessment in the Contemporary Period and Future Directions
EDCOM Report (1991)
The EDCOM report in 1991 indicated that high dropout rates especially in the rural areas
were significantly marked. The learning outcomes as shown by achievement levels show mastery
of the students in important competencies. There were high levels of simple literacy among both
15-24 year olds and 15+ year olds. “Repetition in Grade 1 was the highest among the six grades
of primary education reflects the inadequacy of preparation among the young children. All told,
the children with which the formal education system had to work with at the beginning of EFA
were generally handicapped by serious deficiencies in their personal constitution and in the skills
they needed to successfully go through the absorption of learning.”
Philippine Education Sector Study (PESS-1999)
The PESS was jointly conducted by the World Bank and Asian Development Bank. It
was recommended that:
1. A moratorium on the establishment of state colleges and universities.

2. Tertiary education institutions be weaned from public funding sources.
3. A more targeted program of college and university scholarships
Aside from the government initiatives in funding and conducting surveys that applies
assessment methodologies and processes. Aside from survey studies, the government also
practiced testing where they screen government employees which started in 1924. Grade four to
fourth year high school students were tested in the national level in 1960 to 1961. Private
organizations also spearhead the enrichment of assessment practices in the Philippines. These
private institutions are the Center for Educational Measurement (CEM) and the Asian
Psychological Services and Assessment Corporation (APSA).
Fund for Assistance to Private Education (FAPE)
FAPE started with testing programs such as the guidance and testing program in 1969.
They started with the College Entrance Test (CET) which was first administered in 1971 and
again in 1972. The consultants who worked with the project were Dr. Richard Pearson from the
Educational Testing Service (ETS), Dr. Angelina Ramirez, and Dr. Felipe. FAPE then worked
with the Department of Education, Culture, and Sports (DECS) to design the first National
College Entrance Exam (NCEE) that will serve to screen fourth year high school students who
are eligible to take a formal four-year course. There was a need to administer a national test then
because most universities and colleges do not have an entrance exam to screen students. Later
the NCEE was completely endorsed by FAPE to the National Educational Testing Center of the
DECS.
The testing program of FAPE continued where they developed a package of four tests
which are the Philippine Aptitude Classification Test (PACT), the Survey/Diagnostic Test
(S/DT), the College Scholarship Qualifying Test (CSQT), and the College Scholastic Aptitude
240
Test (CSAT). In 1978, FAPE institutionalized an independent agency called the Center for
Educational Measurement that will undertake the testing and other measurement services.
Center for Educational Measurement
CEM started as an initiative of the Fund for Assistance to Private Education (FAPE).
CEM was headed by Dr. Leticia M. Asuzano who was the executive vice-president. Since then
several private schools have been members to CEM to continue their commitment and goals.
Since 1960 CEM has developed up to 60 more tests focused on education such as the National
Medical Admissions Test (NMAT). The main advocacy of CEM is to improving the quality of
formal education through its continuing advocacy and supporting systematic research. CEM
promote the role of educational testing and assessment in improving the quality of formal
education at the institutional and systems level. Through test results, the CEM helps to improve
effectiveness so teaching and student guidance.
Asian Psychological Services and Assessment Corporation
Aside from the CEM, in 1982 there is a growing demand for testing not only in the
educational setting but in the industrial setting. Dr. Genevive Tan who was a consultant to
various industries felt the need to measure the Filipino ‘psyche’ in a valid way because most
industries use foreign tests. The Asian Psychological Services and Assessment Corporation was
created from this need. In 2001, headed by Dr. Leticia Asuzano, former EVP of CEM, APSA
extended its services for testing in the academic setting because of the growing demand of
private schools on quality tests.
The mission of APSA is a commitment to deliver excellent and focused assessment
technologies and competence-development programs to the academe and the industry to ensure
the highest standards of scholastic achievement and work performance and to ensure
stakeholders' satisfaction in accordance with company goals and objectives. APSA envisions
itself as the lead organization in assessment and a committed partner in the development of
quality programs, competencies, and skills for the academe and the industry.
APSA has numerous tests that measures mental ability, clerical aptitude, work habits, and
supervisory attitudinal survey. For the academe side, they have test for basic education,
Assessment of College Potential and Assessment of Nursing Potential. In the future the first
Assessment for Engineering Potential and Assessment of Teachers Potential will be available for
use in higher education.
APSA pioneered on the use of new mathematical approaches (IRT Rasch Model) in
developing tests which goes beyond the norm-reference approach. In 2002 they launched the
standards-based instruments in the Philippines that serve as benchmarks in the local and
international schools. Standards-based assessment (1) provides an objective and relevant
feedback to the school in terms of its quality and effectiveness of instruction measured against
national norms and international standards; (2) Identifies the areas of strengths and the
developmental areas of the institution's curriculum; (3) Pinpoints competencies of students and
learning gaps which serve as basis for learning reinforcement or remediation; (4) Provides good
feedback to the student on how well he has learned and his readiness to move to a higher
educational level.
241
Building Future Leaders and Scientific Experts in Assessment and Evaluation in the Philippines
There are only some universities in the Philippines that offer graduate training on
Measurement and evaluation. The University of the Philippines offer a master’s program in
education specialized in measurement and evaluation and doctor of philosophy in research and
evaluation. Likewise, De La Salle University-Manila has a master of science in psychological
measurement offered by the psychology department and their college of education which is a
center for excellence has a master of arts in educational measurement and evaluation, and a
doctor of philosophy in educational psychology major in research, measurement and evaluation.
There are only two universities in the Philippines that offer graduate training and
specialization on measurement and evaluation courses. Some practitioners were trained in other
countries such as in the United States and Europe. There is a greater call for educators and those
in the industry involved in assessment to be trained to produce more experts in the field.
Professional Organization on Educational Assessment
Aside from the government and educational institutions, the Philippine Educational
Measurement and Evaluation (PEMEA) is a professional organization geared n promoting the
culture of assessment in the country. The organization started with the National Conference on
Educational Measurement and Evaluation headed by Dr. Rose Marie Salazar-Clemeña who was
the dean of the College of Education in De La Salle University-Manila together with the De La
Salle-College of Saint Benilde’s Center for Learning and Performance Assessment. It was
attended by participants all around the Philippines. The theme of the conference was
“Developing a Culture of Assessment in Learning Organizations.” The conference aimed to
provide a venue for assessment practitioners and professionals to discuss the latest trends,
practices, and technologies in educational measurement and evaluation in the Philippines. In the
said conference the PEMEA was formed. The purpose of the organization are as follows:
1. To promote standards in various areas of education through appropriate and proper

assessment.
2. To provide technical assistance to educational institutions in the area of
instrumentation, assessment practices, benchmarking, and process of attaining standards.
3. To enhance and maintain the proper practice of measurement and evaluation in both
local and international level.
4. To enrich the theory, practice, and research in evaluation and measurement in the
Philippines.
The first batch of board of directors elected for the PEMEA are Dr. Richard DLC
Gonzales as President (University of Santo Tomas Graduate School), Neil O. Parinas as Vice
president (De La Salle–College of Saint Benilde), Dr. Lina A. Miclat as secretary (De La Salle–
College of Saint Benilde), Marife M. Mamauag as treasurer (De La Salle–College of Saint
Benilde), Belen M. Chu as PRO (Philippine Academy of Sakya). The board members are Dr.
Carlo Magno (De La Salle University-Manila), Dennis Alonzo (University of Southeastern
Philippines, Davao City), Paz H. Diaz (Miriam Collage), Ma. Lourdes M. Franco (Center for
242
Educational Measurement), Jimelo S. Tapay (De La Salle–College of Saint Benilde), and Evelyn
Y. Sillorequez (Western Visayas State University).
Aside from the universities and professional organization that provide training on
measurement and evaluation, the field is growing in the Philippines because of the periodicals
that specialize in the field. The CEM has its “Philippine Journal of Educational Measurement.”
The APSA is continuing to publish its “APSA Journal of SBA Research.” And the PEMEA will
soon launch the “Educational Measurement and Evaluation Review.” Aside from these journals
there are Filipino experts from different institutions who published their work in international
journals and journals listed in the Social Science Index.
Activity
Write an essay describing the future direction of educational assessment in the Philippines.
243
About the Authors
Dr. Carlo Magno is presently a faculty of the Counseling and Educational Psychology
Department at De La Salle University-Manila where he teaches courses in measurement
and evaluation, educational research, psychometric theory, and statistics. He took his
undergraduate in De La Salle University-Manila with the degree Bachelor of Arts major
in Psychology. He took his Masters degree in Education major in Basic Education
teaching at the Ateneo de Manila University. He received his PhD in Educational
Psychology major in Measurement and Evaluation at De La Salle University-Manila with
high distinction. He was trained in Structural Equations Modeling at Freie Universität in
Berlin, Germany. In 2005 he was awarded as the Most Outstanding Junior Faculty in
Psychology by the Samahan ng Mag-aaral sa Sikolohiya and in 2007 he was the Best
Teacher Students’ Choice Award by the College of Education in DLSU-Manila. In 2008,
he was awarded by the National Academy of Science and Technology as the Most
Outstanding Published Scientific Paper in the Social Science. Majority of his research
uses quantitative techniques in the field of educational psychology. Some of his work on
teacher performance, learner-centeredness, measurement and evaluation, self-regulation,
metacognition, and parenting were published in local and international refereed journals
and presented in local and international conferences. He is presently a board member of
the Philippine Educational Measurement and Evaluation Association.
Jerome A. Ouano is a faculty of the Counseling and Educational Psychology

Department at De La Salle University-Manila, teaching courses in cognition and
learning, interpersonal behavior, educational psychology, facilitating learning, and
assessment of learning. He is a trainer in the implementation of the new Pre-service
Teacher Education Curriculum of the Philippines in the areas of Assessment of Student
Learning, Facilitating Learning, and Field Study, and helps empower the administrators
and teachers of many Teacher Education Institutions in Mindanao. Mr. Ouano has
presented empirical papers in local and international conferences. He has written books
for Field Study courses, and is active in sharing his expertise on assessment with in-
service teachers in the basic education as well as the tertiary faculty in many schools in
the country. He obtained his Bachelor’s degree in Psychology and Philosophy from Saint
Columban College, and his Master’s degree in Peace and Development Studies from
Xavier University. He is currently finishing his PhD in Educational Psychology major in
Learning and Human Development at De La Salle University-Manila.

Designing Written Assessment of Student Learning

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Designing Written Assessment of Student Learning

Uploaded by

Copyright:

Available Formats

1

Designing Written Assessment of

1. Describe assessment in the educational and classroom setting.

1 Assessment in the Classroom Context

Lesson 1: Assessment in the Classroom Context

To better understand the nature of classroom assessment, it is important to answer three

What is How is assessment When is assessment

It is customary in the educational setting that at the end of a quarter, trimester, or

1. What are the other processes involved in assessment?

Ask a sample of students the following questions:

1. Why do you think assessment is needed in learning?

Tabulate the answers and present the answers in class

The Nature of Measurement

Measurement is an important part of assessment. Measurement has the features of

1. Quantifying characteristics or attributes determines the amount of that attribute

2. Quantification facilitates accurate information. If a student gets a standard score of -2

4. Quantification allows classification of groups. The common way of categorizing

The process of measurement in the physical sciences (physics, chemistry, biology) is

The Nature of Evaluation

Standard Summary Components

2. Clarificative. This is conducted during program development. It focuses on the

3. Interactive. This evaluation is conducted during program development. It focuses on

5. Impact. This evaluation is conducted when the program is already established. It

1. Objectives-oriented. This model of evaluation determines the extent to which the

2. Management-oriented. This model is used to aid administrators, policy-makers,

3. Consumer-oriented. This model is useful in evaluating whether is product is

4. Expertise-oriented. This model of evaluation uses an external expert to judge an

6. Program Theory. This evaluation is conducted when stakeholders and evaluators

Form of Evaluation Focus Models of Evaluation

Form of Focus Models of Evaluation

The NSTPCW1 and NSTPCW2 of a An evaluation of the Community Service

The studies done for each country are Method of Studies

Recommendations Bray, M. (1996). Decentralization of Education

2. It is recommended to have a specific section Hedges, L. V. (eds.) The Handbook of Research

Philippines - Vocational Training Project. (1994).

Potashnik, M. (1996). Chile's Learning Network.

1. Classroom assessment can be defined as the collection, evaluation, and use of

1. Process of collecting various information. A teacher arrives at an assessment after

2. Integration of overall information. Coming up with an integrated assessment from

3. Attainment of goals and purposes. Assessment is conducted based on specified goals.

The Process of Assessment

1. Assessment begins with an analysis of criterion. The identification of criterion

Learning Learning Assessment

Assessment comes in different forms. It can be classified as qualitative or quantitative,

Quantitative and Qualitative

Assessment is not limited to quantitative values, assessment can also be qualitative.

Structured vs. Unstructured

Objective vs. Subjective

Assessment can be objective or subjective. Objective assessment has less variation in

Components of Classroom Assessment

Performance assessment is a form of assessment that requires students to perform a task

Assignment is a kind of assessment which extends classroom work. It is usually a take

Paradigm Shifts in the Practice of Assessment

Testing Alternative assessment

Paper and pencil Performance assessment

Multiple choice Supply

Single correct answer Many correct answer

Outcome only Process and Outcome

Skill focused Task-based

Isolated facts Application of knowledge

Decontextualized task Contextualized task

External Evaluator Student self-evaluation

Outcome oriented Process and outcome

Conduct a simple survey and administer to teachers the questionnaire:

Gender: _ Male Female Years of teaching experience: ______