Better assessment, brighter future

The four steps are:


Step 1: Test purpose
Defining the test objective 
Defining the test design

Step 2: Construction
Item creation

Step 3: Assembly
Item selection
Test assembly

Step 4: Reporting
Reference framework

 Print this page

What are the steppingstones for developing a test?

It goes without saying that the test building process requires the utmost care. Typically, it can be divided into four different steps. The details of each will depend on the purpose of the test and the conditions under which the test is to be created.

You need to take into account all the different steps in the test process when you are setting up any test software. Decisions of all kinds are made, based on information provided by test results. These decisions affect individuals, groups and businesses, and can even have an impact at a national or European level. This is true of all tests, though the nature of the information provided and the types of decisions will vary. The test criteria will depend on the type of decision to be made. These criteria may include technical features and functions, as well as reliability and validity.

The purpose and objective of a test affect the steps to be taken and the choices to be made during test building. More importantly, they affect decisions which may be made based upon test results.

Together, we will look at three important questions which underpin the construction process.
  • What impact will the prospective decision have on the different levels (individual, group, school, organisational)?
  • What is the purpose of the test?
  • What are the implications of the test results, and for whom?

Step 1: Test purpose

Defining the test objective
During this stage, we will consider two key questions with you:

  • What do we want to measure?
    When creating a test, we need to be mindful of what we want to measure. For example, do we want to measure reading skills or a person's attitude towards work? When defining competencies, core objectives and frequently used methods act as a guide. At this point, it helps to consider what you want the reports to say.
  • Why do we want to measure?
    Ideally, tests would be created as a means of systematically tracking candidates' progress. Results can be used to compare individual or groups of candidates with one another. They can also be used to compare with people elsewhere, or against a set of objectives. In doing so, you gain information about the results of training provided at individual, group and company levels. You can also define focus points, which can help to actively improve teaching or training.


Defining the test design
A test must satisfy a number of requirements as follows:
  1. Your wishes as the client
  2. The characteristics required of the candidates
  3. Fixed requirements, such as reliability and validity.
The objectives of the test help to determine the type of items, their level of difficulty, and any variation in difficulty between items.
Here are some important points to bear in mind when designing tests:
  • Size
    If we want our measurements to be reliable, we need a minimum number of test items. Tests can be divided into various sections or tasks. This enables us to make allowances for the candidate's concentration span, whilst still ensuring that we collect sufficient information.
  • Content
    We use blueprints to provide guidance on how many items should be presented in each component or sub-component of the test.
  • Level
    When determining the test level, we consider the various target groups.
  • Item type
    The choices open to us here could include: open-ended or multiple-choice questions, paper or computer-based tests, with or without an answer sheet, with or without pictures, individual or group administration, question type.

Step 2: Construction

Item creation
The items are created by a special team under the guidance of a test expert. The team is made up of teachers/trainers and other experience-based experts.
  • The team is given an assignment. This states which items need to be created and the question type to be used. This could be the spelling categories in a spelling and grammar test, for example.
  • The members of the team create individual items and submit these items to the team for analysis.
  • Members of this team then meet to discuss the items they have produced. They critically analyse the content, the use of language, the level and the context.


We organise a pre-test to establish whether each item measures what we want it to measure.

Step 3: Assembly

Item selection
This pre-test data is used for the final item selection.
  • Item difficulty index
    To determine the difficulty index of an item, we look at the percentage of candidates who answered the item correctly. We omit items that are too easy or too difficult. Such items are sometimes saved to be used in a different test.
  • Alternative responses
    In the case of multiple-choice items, we look at whether the majority of candidates managed to select the right answer. If not, the item will not feature in the final test.
  • Competence
    Which candidates have provided right or wrong answers to a particular item? Do the most competent candidates select the right answer to relatively simple questions? More competent candidates should have a greater chance of correctly answering items.
  • Response field
    Items that function well statistically, but which are deemed 'poor items' by experts, are not included in the final test.
We use all of this information when assembling a test. Individual answers to any item tell us little about actual competence. After all, luck can sometimes play a major role in testing. What matters is how candidates respond to a series of items. It is only by looking at this as a whole that you, the client, can obtain reliable information.


Test assembly
A test is assembled from the selected items. This should be done carefully, in order to ensure:
  • A variety of content
    Utilising different elements from a particular field, in pre-agreed proportions.
  • A variety of levels
    Using an assortment of easy, average and difficult items means that a distinction can be made between more and less competent candidates.
  • Suitability of levels
    The average candidate should be able to answer the vast majority of the items correctly.

Step 4: Reporting

Reference framework
The methods used to report the test scores are found in the reference framework. Viewed alone, scores have no meaning. A candidate’s score is only significant when compared against a set standard, or with the scores obtained by other candidates.
When an individual's grade is based on a comparison with the performance of a significant group of candidates, a more accurate assessment of each individual's relative performance can be made. The comparison group is referred to as a norm group or reference population. The grading scale showing the performance of the norm group is known as a norm scale.


The final phase of test building involves producing a manual, a justification, and instructions for those involved in the test.
We recommend always writing a justification. This is particularly relevant for high-stakes tests, where it is essential to be able to explain the results of the test or decision.
The COTAN (Dutch Association of Psychologists' Committee on Tests and Testing) uses an assessment system to audit the quality of tests. This system describes the criteria applied when assessing the test material, instructions and justification.