FULL 009
THE NATIONAL EXAMINATION OF ENGLISH:
A VALIDITY-BASED ACCOUNT[*]
by Ismail Petrus[†]
Abstract: The introduction of national examinations in Indonesia has caused much controversy. Some people think that the use of centralized tests is in conflict with the Law No. 20/2003 on national educational system. Some others argue that the law requires national monitoring of the students’ achievement levels of standard competencies and controlling of the quality of education through nationwide evaluation Those who agree with the administration of national examinations still question the validity and reliability of the tests. Validity of a test has traditionally been defined as ‘the degree to which a test measures what it claims, or purports, to be measuring’ (Brown, 1996:231). The traditional perspectives of test validity include content validity, construct validity, and criterion-related validity. However, the recent perspectives of test validity cover both the evidential and consequential bases of test interpretation and use. This paper reviews the senior-high-school national examination of English on the basis of the test validity typology proposed by Weir (2005), which includes a priori validity before the test event and a posteriori validity after the test event. Some suggestions are provided if the government should continue to use the national examination as a tool to assess students’ performance.
Key words: national examination, a priori validity, a posterior validity
High-school centralized tests have been administered in Indonesia since 1980. They were called EBTANAS (Evaluasi Belajar Tahap Akhir Nasional = National Final Evaluation of Students’ Learning) from 1980 to 2001, and then UAN (Ujian Akhir Nasional = National Final Examination) in 2002. They have later been named UN (Ujian Nasional = National Examination) since 2005. The national examinations have caused much controversy. Some people think that the introduction of national examinations is in conflict with the Law No. 20/2003 on national education system. Article 58 of the law states that teachers evaluate their students in terms of the learning process, progress, and remedy. However, some others argue that Articles 35 and 57 of the law respectively requires national monitoring of the students’ achievement levels of standard competencies and controlling of the quality of education through nationwide evaluation (see Furqon, 2004). Some educational activists urge the government to consider the uneven quality of schools nationwide, including poorly skilled teachers and improper facilities in a number of regions, before the government keeps pressing ahead with the nationwide examination system (The Jakarta Post, 26 June 2006). Those who agree to the government’s decision still question the validity and reliability of the national examinations.
Validity of a test has traditionally been defined as ‘the degree to which a test measures what it claims, or purports, to be measuring’ (Brown, 1996:231). For example, if a test is designed to measure the objectives of a specific course, the validity of the test could be defended by showing that the test is indeed measuring those objectives. The traditional perspectives of test validity include content validity, construct validity, and criterion-related validity (Hatch & Farhady, 1982; Crocker & Algina, 1986; Wallen & Fraenkel, 1991; and Hughes, 2005). Recently, expanded views of validity have begun to surface in the educational measurement literature. For example, Cronbach (1988) covers the functional, political, economic, and explanatory perspectives of test validity. Messick (1989) discusses the evidential and consequential bases of test interpretation and use. Weir (2005) proposes a priori validity before the test event and a posteriori validity after the test event.
This paper reviews the senior-high-school national examination of English on the basis of the test validity typology proposed by Weir (2005). Weir (2005:12) defines validity as ‘the extent to which a test can be shown to produce data, i.e., test scores, which are an accurate representation of a candidate’s level of language knowledge or skills’. In this definition, validity resides in the scores on a particular administration of a test rather than in the test per se. Validity is multifaceted and different types of evidence are needed to support any claims for the validity of scores on a test. These are not alternatives but complementary aspects of an evidential basis for test interpretation. There are two main categories of validity: a priori validity at the stage of test development before the test event and a posteriori validity after the test event.
A Priori Validity
A priori validity includes theory-based validity and context validity. Theory-based validity refers to the extent to which a test is constructed on the basis of prevailing theories concerning the language processing which underlies the various operations required in real-life language use. Context validity refers to the extent to which the choice of tasks in a test is representative of the larger universe of tasks of which the test is assumed to be a sample. This coverage relates to linguistic and interlocutor demands made by the task(s) as well as the conditions under which the task is performed arising from both the task itself and its administrative setting. According to Weir (2005:14), construct validity is a function of the interaction of these two aspects of validity. A test should, therefore, always be constructed on an explicit specification which addresses both the cognitive and linguistic abilities involved in activities in the language use domain of interest, as well as the context in which these abilities are performed. There are two major threats to construct validity: construct under-representation and construct irrelevance (Messick, 1989). Test developers need to ensure the constructs elicited are precisely those intended to and that these are not contaminated by other irrelevant variables. If important constructs are under-represented in a test, this may have an adverse backwash effect on the teaching that precedes the test.
The Regulation of the Ministry of National Education No 23/2006 specifies the standard competencies of senior-high-school students who learn English, as follows: (a) the students understand oral formal and informal interpersonal and transactional discourses in the form of recount, narrative, procedure, descriptive, news item, report, analytical and hortatory exposition, spoof, explanation, discussion, and review, in terms of daily contexts, (b) the students orally express formal and informal interpersonal and transactional discourses in the form of recount, narrative, procedure, descriptive, news item, report, analytical and hortatory exposition, spoof, explanation, discussion, and review, in terms of daily contexts, (c) the students understand written formal and informal interpersonal and transactional discourses in the form of recount, narrative, procedure, descriptive, news item, report, analytical and hortatory exposition, spoof, explanation, discussion, and review, in terms of daily contexts, (d) the students write formal and informal interpersonal and transactional discourses in the form of recount, narrative, procedure, descriptive, news item, report, analytical and hortatory exposition, spoof, explanation, discussion, and review, in terms of daily contexts. These competencies include all the four language skills (listening, speaking, reading, and writing).
The 2007/2008 national examination of English consists of 50 multiple-choice items: 15 listening comprehension items (understanding dialogues, giving responses, and understanding monologues) and 35 reading comprehension items (understanding written dialogues, advertisement, and reading passages). The examination assesses two language skills (listening and reading) only. The ministry of national education assumes that speaking and writing skills will be assessed by school teachers themselves. However, as speaking and writing skills are not represented in the examination, teachers may simply not teach the language skills, and students may not learn the skills. Shohamy (2005:107) states that centralized tests are capable of dictating to teachers what to teach and what test-takers will study. Teachers will focus on teaching test language and emphasize the material that is to be included on the test. If for practical reasons the examination could assess listening and reading skills only, then the government should redefine the objectives of teaching English to high-school students. Basically, learning a language aims at developing ‘the four levels of literacy, namely performative, functional, informational and epistemic levels’ (Wells, 1987 in Alwasilah, 2006:109), which respectively refer to the ability to read and write, the ability to use the language in everyday communication, the ability to access knowledge, and the ability to transform knowledge. Alwasilah (2006) proposes that the four levels of literacy are taught in stages in accordance with the levels of education: the first level of literacy is taught to elementary-school pupils, the second level to junior-high-school students, the third level to senior-high-school students, and the fourth level to university students. Therefore, the objectives of teaching English to senior-high-school students can be limited to the ability to access knowledge in English.
Besides, there is a question about the test development. According to SEAMEO Library (2001), test items are solicited at the district and provincial levels throughout the country. Teachers from selected schools are invited to become part of item writing teams. Each team produces 50 to75 items for one national examination. These items are then sent to Jakarta, and selected and reviewed by the National Examination Committee. This procedure tends to give the districts and provinces a sense of involvement. However, in terms of credibility and practicality, the national examinations should be developed by professional test-developers. The construct validity of a test does not lie in the sense of involvement, but in the representativeness and relevance of samples of abilities or skills being measured.
A Posteriori Validity
A posteriori validity includes scoring validity, criterion-related validity, and consequential validity. Weir (2005) uses the term ‘scoring validity’ instead of ‘test reliability’. Scoring validity is the superordinate for all the aspects of reliability in line with the growing consensus that it is a valuable part of a test’s overall validity. Scoring validity concerns the extent to which test results are stable over time, consistent in terms of the content sampling and free from bias. Hatch & Farhady (1982:253) emphasize some factors which affect the validity of a test: test administration and scoring procedures. Instability of test scores resulted from poor test administration in which there is an opportunity to ‘cheat’ would influence the scoring validity. In test administration, proctors are an important factor. Proctors should make sure that there is no cheating in the test. However, in some cases of the national examination administration, proctors who are also teachers ’help’ students by giving the answer key. Besides, at the district or regional level there is a ’success team’ which ’corrects’ the students’ answer sheets (Koran Tempo, 4 Februari 2005, http://www. antikorupsi.org/mod.php?mod=publisher&op=viewarticle&artid=3764). Teachers and administrators often view the national examinations not only as testing the language performance and achievement levels of their own students but also as assessing or testing their own performances. They would then change their behaviour (negatively) to maximize the test scores (by cheating, for instance) given the consequences of successful or unsuccessful performances on the examinations. With regard to responses scoring, in 2004 there was some controversy over the use of score conversion tables which attemptted to help slow students but was disadvantegous to bright students (Tokoh Indonesia, http://www.tokohindonesia.com/ majalah/22/kilas-un.shtml). The test results derived from a poor scoring system, and the examinations were not managed by an authorized testing institution (Pustaka Mawar, 5 December 2007 http://pustakamawar.wordpress.com/2007/12/ 05/un-penjamin-mutu). Again, this is another form of ‘cheating’.
Criterion-related validity refers to the extent to which test scores correlate with a suitable external criterion of performance with established properties. The validity is the degree to which the first test is seen as related to the established criterion. By showing this relationship, one feels more confident in claiming the test as a valid measure of the same thing that was measured by the criterion test (Hatch & Farhady, 1982:251). There are two types of criterion-related validity: concurrent validity and predictive validity. Concurrent validity looks for ‘a criterion which we believe is also an indicator of the ability being tested’ (Bachman, 1990:248). Test scores could be correlated with another measure of performance, usually older, longer, established test, taken at the same time or teachers’ rankings of students, or even student self-assessment. Predictive validity is concerned with making certain predictions about students’ future performance on the basis of test results. Predictive validity can be established by correlating language performance against later job/academic performance.
There has no research on the criterion-related validity of the national examination of English. Abdul-Hamied (1993) conducted a national research on English language teaching in 358 senior high schools in 26 provinces, and he found out that the results of the national examination of English were discouraging: 66.7% of the students had the scores below 6.0. In 2006 among the provinces with the lowest percentage of students passing the national examinations were North Maluku (72.57%), East Nusa Tenggara (75.37%) and South Kalimantan (77.37%). (The Jakarta Post, 26 June 2006). The cut-score for passing or failing the national examinations was 4.26 in 2007 and 5.25 in 2008. However, according to the Ministry of National Education (2007:141), the senior secondary national examination scores for English have risen from 4.8-5.3 in 2004 to 6.9-8.0 in 2006.
Consequential validity refers to the appraisal of potential and actual social consequences of test interpretation and use. The appropriateness, meaningfulness, and usefulness of score-based inferences depend as well on the social consequences of the testing (Messick, 1989:18). A test is likely to have a backwash effect. Backwash is defined as the effect of a test on teachers, learners, parents, administrators, textbook writers, instruction, classroom practice, educational practices and beliefs, and curricula. Backwash may refer to both intended positive or beneficial effects and to unintended harmful or negative effects (Bachman & Palmer, 1996). The negative effects of the national examinations are, for instance, students’ committing suicide and vandalizing after the announcement of the national examination results for junior and senior high school students (The Jakarta Post, 25 June 2007, http://pendidikan.net/mod.php?mod= publisher&op=viewarticle&cid=35&artid =385&PHPSESSID=8321928b48b9c84095882195192b30e0), and teachers’ and administrators’ improper attempts to ‘help’ their students as already mentioned earlier. The government should, however, encourage beneficial effects by first improving the quality and administration of the national examinations in order to obtain reliable data of students’ performance, and then on the basis of the data, taking appropriate measures to improve the quality of education in Indonesia.
Suggestions
Should the government continue to use the national examination as a tool for assessing students’ performance, there is a need to do the following:
(1) Let professional test-developers develop the national examination of English so that there will be no question about the construct validity of the test. The test should not be a compilation of teacher-made selected items. It could be developed by an independent educational testing institution, or in the case of English test, the TEFLIN (Teachers of English as a Foreign Language in Indonesia) organization may be asked to develop the test.
(2) Manage the administration of the national examination at schools properly so that there will be no cheating from students, teachers, and administrators. Cheating affects the reliability of the test results. Proctors should be teachers from other schools, and the persons in charge of the test administration at schools should also be the principals from other schools. There should be no opportunity to let students’ answer sheets ‘stay for some time’ at schools or regional educational offices. High-school test administrators can learn from the university entrance test administration.
(3) Conduct research on the criterion-related validity of the national examination in order to convince the test users that the test is a valid measure. Use TOEFL (Test of English as a Foreign Language) or IELTS (International English Language Testing System) to find out the concurrent validity of the test.
(4) Monitor the backwash effects of the national examination on students, teachers, parents, and administrators. If the national examination is considered as a high-stake test, the government should anticipate the detrimental effects of the test. The government should also use the test results to improve the quality of education by upgrading teachers and providing appropriate school facilities.
In conclusion, the use of high-stake tests, i.e. national examinations, demands the government to provide validity evidence of the instrumental value of the tests. It is the right of all test users to ask for evidence that demonstrates the tests are doing the jobs that they are supposed to be doing. This requires evidence of a priori validity and a posteriori validity. The government could convince the test users by developing the tests professionally, administering the tests properly, providing an appropriate scoring system, conducting adequate research on the correlation of students’ performances on the tests with trustworthy external measures, and responsibly dealing with the backwash effects of the tests on stakeholders.
References