Sunday 16 March 2014

Student Evaluations and Statistical Hypothesis Testing

I had been thinking about the impact of increase in class strengths in higher education on the quality of performance evaluations and eventually award of grade to a student in a course --- an important exercise in separating the wheat from the chaff --- which assumes greater significance in our attempt to stretch the bottom threshold just to fill the available seats in our academic programmes. At a first glance, it appears that we have decreased the rigour in evaluations (both in taught courses and also dissertations).  Let us first talk about the taught courses, we shall take up the evaluation of dissertations later.

The 50% across the board increase in student intake in all academic institutions has necessitated accommodating a large number of candidates with questionable academic preparation in our post-graduate courses. Some of these students find it extremely difficult to cope with the (somewhat watered down) rigours of the academic programme and perform rather poorly in some courses. However, it is surprising to find same candidates performing at "Above Average" level in other courses and also at times managing an "Excellent" grading in dissertation. This, despite a common refrain from all during tea-time discussions about the decreasing level of aptitude and commitment of the graduate students. I find it rather difficult to reconcile the two assessments: the informal assessment is very poor quality of work and yet the formal assessment on the grade sheet reflects an "Excellent" grade for the dissertation!

The process of award of grade to a student is an exercise in making a decision whether a student has understood the subject, and if the answer is yes, then to what extent.  The performance of the student is tracked through a variety of assessments throughout the semester leading to a final score for the award of a grade. This final score is referred to as "the test-statistic" in the parlance of statistical hypothesis testing: a standard tool of decision making based on statistical inference. In its simplest form, the test involves formulating a "Null Hypothesis" (H0), which is considered to be the working rule until it can be established by way of some evidence that it is not true and should be abandoned in favour of the "Alternate Hypothesis", typically denoted by H1. The formulation of an appropriate null hypothesis is the most crucial part of the test and it is stated in a form which makes it easier to test for its falsity.  The process of student's evaluation can be stated as:

The null hypothesis, H0: the student knows the subject and deserves a passing grade, 

and

The alternate hypothesis, H1: the student does not know the subject and consequently does not deserve a passing grade.

So the process of evaluation begins with the assumption that the student knows the subject unless it can be proved to be otherwise. The variety of assessments are then aimed at trying to falsify this assumption. This has an important bearing on the way the examinations are set up --- it is important that the student's are examined thoroughly about the subject and the tests should be designed to falsify the null hypothesis. Since nothing in this world is perfect, the hypothesis testing too has its fair share of flaws and two types of errors are possible:
  1. Type I error: reject the null hypothesis H0 when it is true --- the false alarm, and 
  2. Type II error: do not reject the null hypothesis H0 when it is false --- the missed alarm. 
The probability of Type I error is related to the "significance level" of the test while the "power" of test refers to the probability of not committing Type II error. Obviously, eliminating both types of errors is impossible.  The chances of Type I error can be reduced by expanding the range of  "acceptable range" of the test-statistic but that increases the chances of Type-II error where the null hypothesis might not be rejected even when it is not true. The significance level, i.e., the probability of Type I error is decided beforehand and is kept at the largest tolerable level (typically 0.1, 0.05, or 0.01) consequently the "acceptable range" of test-statistic is established (larger the probability of Type I error smaller is the acceptable range for test-statistic and hence smaller chances of Type II error) . In the context of student evaluation this translates into designing an examination that is consistent with the choice of the probability of Type I error. An easy examination along with a small probability of Type I error, say, 0.01, makes it almost impossible to reject the null hypothesis of all students knowing the subject and thereby increase the probability of Type II error of a student getting a passing grade even if (s)he may not have a passing understanding of the subject. A study of the patterns of grades awarded in recent times indicates that we might have been committing Type II errors (missed alarm) in several cases by awarding passing grades to the undeserving candidates while trying to minimize the Type I errors (false alarm). While a possibility of Type I error might result in a temporary setback for a few candidates, it is good for the academic programme of the institute if it prevents a Type II error of awarding degrees to the undeserving which might have long term implications in hurting the academic standing of the institute. If we do not take care in minimizing Type II errors (even if it is at the cost of slightly increased chances of Type I error) then the doomsday may not be far off when the industry would cease to recognize the premium of an IIT degree. 

There has been a steady decline in the quality of M.Tech. dissertations with very few leading to scholarly publications which was the primary objective of increasing the duration of dissertation from one semester to two semesters. Year after year, we have a horde of students graduating with "Excellent" grade for their dissertations but not leading to any scholarly publication. This begs a little introspection --- probably there is a need to recalibrate our scales of grade and be a little more objective about these evaluations. It appears that quite a few of these students with "Excellent" grades in dissertations do not have a CGPA more than 7.0-7.5 after the two semesters of course work --- roughly mapping to "Average" performance. I agree that there could be some exceptional cases where the effort put in dissertation work might outshine that in a regular course work --- but such cases are rare and for all that it is worth, the CGPA at the end of two semesters of course work gives a fairly well indication of the student's potential at present. The "Excellent" grade in dissertation leads to artificial inflation of the CGPA at the end of the academic programme. It may be better to switch to a Satisfactory/Unsatisfactory grading of dissertation to retain some objectivity of CGPA as a measure of students' academic performance.