Evaluating the Performance and Forecasting Outcomes of the Universal Community Testing Programme for COVID-19 in Hong Kong Using Confusion Matrices

Catrina Ko
The Startup
Published in
15 min readSep 8, 2020

--

Dr Catrina Ko (Twitter: @dr_CatKo)
Hong Kong Global Connect (Twitter: @HKGlobalConnect)

6th September 2020

Abstract

Controversy over the efficacy and numerous other aspects of the mass screening programme for COVID-19 in Hong Kong has inspired interest into whether the scheme is succeeding in achieving its aim to identify silent carriers from the population accurately. This study retrospectively evaluated the test’s current performance by feeding data of the programme thus far into a confusion matrix and comparing them against the model outputs. Simple predictive analytics was also used to forecast the outputs that the testing procedure would continue to generate. Evaluation measures were then used to examine the models, which simulated how the test was expected to perform in real life, in both parts of the study. Results indicated that overwhelmingly large numbers of results would be generated for participants who are healthy, as opposed to those who are actual carriers, due to the low prevalence of the disease in Hong Kong. The study concluded that the low prevalence has knocked the class sizes of the confusion matrices out of balance and compromised the test’s ability to accurately classify and assign test outcomes to subjects. Given this verdict, it is unsure whether the testing programme would help or harm the COVID-19 situation in Hong Kong from a mathematical standpoint.

Introduction

The Universal Community Testing Programme for COVID-19 is a free, voluntary mass screening scheme in Hong Kong. It is led by the HKSAR government in collaboration with the central government of the People’s Republic of China and with the supply of scientific expertise, technology and personnel from mainland laboratories and institutions, such as the BGI, which also manufactures the test kits used in the programme. The programme began on 1st September 2020 and has attracted controversy both before and after its commencement. Scientific and medical experts have questioned the programme’s efficacy and value in quelling the outbreak (University of Hong Kong, 2020), particularly when the scheme is not accompanied by a mandatory post-test quarantine for individuals awaiting their results (Zhou, Pang, Zaharia & Fernandez, 2020), and whether the base rate fallacy brought by the low prevalence entails that the scheme would instead endanger the population further (Hamlett, 2020). Members of the public concerned with privacy (Liu & Woodhouse, 2020) and the cost-effectiveness of the scheme have also expressed their doubts. The authorities’ stance remains that mass testing is necessary in order to ‘break the chain of transmission’ and will effectively identify the silent carriers within the community who are spreading the virus but are asymptomatic themselves, while encouraging the entire population to take part (Hong Kong leader chides critics of universal coronavirus test, 2020) (Kwan, 2020).

As of 20:00 HKT on 6th September 2020, approximately 1,132,000 people have signed up for the scheme; and about 675,000 samples have been PCR-tested for SARS-CoV-2 by the mainland experts (預約普及檢測人數增至逾113萬, 2020). The programme has identified 15 positive cases of COVID-19 so far (港增21確診7冇源頭 外傭疑傳染同住九旬夫婦 另添3死, 2020), 12 of which asymptomatic at the time of the test. At this midpoint of the duration of the programme, which is scheduled to finish on 11th September, it might be of value to statistically interrogate these data and evaluate its performance thus far, in light of the public debate. This study also aims to produce a mathematical prediction of its outcomes for the 1.1 million participants of the scheme using simple predictive analytics commonly adopted in medical settings, and observe whether these forecasts match with what the authorities envisioned the situation to be (林鄭月娥:4成個案源頭未明 社區隱形患者傳播力高, 2020) and what they expected of the scheme.

Method

The objective of the study was to contextualise the data of the mass testing programme at a point where the numbers, especially the number of hits, are large enough for meaningful analyses and evaluation. These data were obtained from daily news reports, with the latest reported figures before midnight on the day of writing (6th September 2020) being used as inputs for this study. The numbers taken were: the total number of sign-ups for the scheme, a; the number of samples tested in the laboratories, b; the number of positive cases yielded (the number of hits) therein, c; the total number of confirmed cases (both on and off the scheme combined), d; the number of deaths, e; and the number of recoveries, f.

The study was divided into two parts, with the first being a retrospective evaluation of the medical screening on the samples already tested, and the second being a forecast of test outcomes for all participants of the testing programme. Both parts involved the use of a confusion matrix as both an organiser of the numbers and a description of the proportion of predicted outcomes generated by the test based on the existing data.

In the first part, the ‘test probability’, (c/b)*100%, was firstly calculated for reference and for comparison with the existing data. The prevalence of COVID-19 in Hong Kong, ((d-(e+f))/7 500 000)*100%, was also determined. The prevalence was then used to compute the numbers of true positives (TP), false positives (FP), false negatives (FN), and true negatives (TN), as well as the positive predictive value (PPV) and the false omission rate (FOR), in a confusion matrix (Fig. 1). The Matthews correlation coefficient (MCC) was calculated in the end using the results from the confusion matrix as an evaluation measure of the quality of the classifier prediction which is the medical test itself in the context of this study. A few performative points would then be qualitatively evaluated from these metrics for the samples that had already been analysed in the screening programme.

In the second part, the prevalence was fed into this predictive part of the study that models the outcomes of the screening for a. A confusion matrix (Fig. 2) was also drawn for computing the same set of test outcomes and metrics as from the first part. The F1 score was calculated as an additional measure of the test’s performance as the test here would be applied to subjects whose samples are yet to be analysed. In the part of the study, the number of ‘silent carriers’ in the population, (TP+FN)*7 500 000/a, that the test should provide based on the outcomes of this model was also calculated. The figure would then be compared with the authorities’ initial estimates of the number of silent carriers which presumably worried and urged them into pushing this scheme out quickly without public consultation. A qualitative evaluation on the performance of the test inferred from the calculated metrics would also be discussed within the context of the current COVID-19 situation in Hong Kong.

Analysis

There might (and would) be better statistical methods to scrutinise the performance and efficacy of this mass screening programme and the kit it uses to test for SARS-CoV-2, and perhaps even over time given the constantly evolving data. However, this study was limited by the availability of finer information on the specification of the kit and any of its rivals for comparison. The MCC and the F1 score were very crude performance measures, but were what could be conveniently used with the information at hand. The confusion matrix is also the simplest and most common technique standardly used to at least preliminarily evaluate a medical screening test, requiring easily obtained and derived information such as the number of samples being tested, the prevalence of the disease, and the sensitivity and specificity of the kit used for the test. Thus, although this study was not strictly a statistical interrogation, it did use statistical techniques and predictive analytics to achieve its objectives and lay out the mathematical anatomy of the data in an organised way.

Results

As of 20:00 HKT on 6th September 2020, approximately 675,000 samples, b, had been PCR-tested for SARS-CoV-2, out of which 15, c, were confirmed positive. Using these numbers, the ‘test probability’ to be used for comparison and evaluating the test’s performance to date was calculated as:

Test probability = (c/b)*100% = (15/675 000)*100% ≈ 0.0022%
(Eq. 1)

where c is the number of hits and b is the number of samples that had undergone PCR analysis.

The prevalence of COVID-19 in Hong Kong was also calculated to be fed into the models (Fig. 1 and 2):

Prevalence = (No. of active cases/Total population)*100% = ((d-(e+f))/7 500 000)*100% = (271/7 500 000)*100% ≈ 0.0036%
(Eq. 2)

where d is the total number of confirmed cases, e is the number of deaths, and f is the number of recoveries, all as of the end of 6th September 2020.

The test outcomes, i.e. the numbers of TP, FP, FN, and TN, were then computed using Eq. (2), the number of samples analysed (b), and the specification of the kit as given by the BGI, 99% for both the sensitivity and the specificity, in the confusion matrix (Fig. 1). The PPV and the FOR were then derived from the test outcomes. The total number of actual positives, TP+FN, was 24.30. Figure 1 summarises the results for this part of the study:

Figure 1: Confusion matrix for evaluating test performance on the samples that had already undergone the PCR test

The quality of the test’s classification was given by the MCC, calculated using the formula below (Matthews, 1975) and the outcomes from Fig. 1:

MCC = (TP*TN)-(FP*FN)/√(TP+FP)(TP+FN)(TN+FP)(TN+FN) ≈ 0.059
(Eq. 3)

Taking the result from Eq. (1), the total number of sign-ups for the mass screening programme as of 20:00 HKT on 6th September 2020, a, which is 1,132,000, and the specification of the kit as provided by the BGI, the predicted test outcomes for a were computed in the confusion matrix (Fig. 2), and from which the PPV and the FOR were derived. The sum of TP and FN, and thus the total predicted number of ‘silent carriers’ that the testing procedure would discover by the time all a had been tested for COVID-19, was 40.75. These are summarised in Fig. 2:

Figure 2: Confusion matrix for predicting the outcomes for all participants of the mass testing programme

The F1 score for the model in this part of the study was calculated to be:

F1 score = 2((Precision*Sensitivity)/(Precision+Sensitivity)) = 2((0.0036*0.99)/(0.0036+0.99)) ≈ 0.0072
(Eq. 4)

Taking the outcomes from Fig. 2, the MCC for this part of the study was:

MCC = (TP*TN)-(FP*FN)/√(TP+FP)(TP+FN)(TN+FP)(TN+FN) ≈ 0.059
(Eq. 5)

Given the outcomes generated by the confusion matrix (Fig.2), this model predicted the number of ‘silent carriers’ in the population, i.e. the model-predicted prevalence of COVID-19, to be:

Predicted number of silent carriers in the population
= (TP+FN)*7 500 000/1 132 000 ≈ 269.99
(Eq. 6)

Discussion

The purpose of this study was to evaluate the performance of the Universal Community Testing Programme for COVID-19 on the data already out and available of the scheme. Predictions were also made for the test subjects who are yet to have their samples analysed or receive an outcome. These were achieved by simple mathematical computations. The models, driven purely by algorithmic rules and pre-existing data, thus produced highly objective numerical results and evaluation measures without moderation by any inbuilt qualitative terms or ecological (real-world) variables.

The first part of the study examined the data of the cases that had already been PCR-tested by the end of the 6th day of the programme in a mathematical setting by placing them into a confusion matrix (Fig. 1). The number of actual positives, TP+FN, did not match with what has been reported as the number of positive cases yielded from the programme, which fell short by 9 as of the end of 6th September 2020. A probable explanation for this could be that the sensitivity of the test kit might not be as high as 99% in reality, or that external factors during the process of sample collection or laboratory analysis had affected the results. The number of FP did not materialise, presumably as the authorities had done further tests on the positive outcomes from the first laboratory run to confirm positivity before publishing the results (普及檢測驗出六宗確診個案, 2020). This study presumes that this practice will continue throughout the duration of the programme.

The second part of the study was an attempt to forecast the test outcomes for all participants (i.e. including those whose samples are yet to be analysed), also using a confusion matrix (Fig. 2). It is worth noting that the evaluation measures, e.g. the MCC, the PPV, and the FOR, were very similar between the two sets of results. This is due to the fact that the same prevalence (Eq. (2)) was used in both computations but that only the number of samples had varied. The model predicted 40.75 (the sum of TP and FN from Fig. 2) actual positives, asymptomatic at the time of sample collection, to be identified when all of these subjects (a, as of the end of 6th September 2020) have had their samples analysed. This forecast can be evaluated when the corresponding real-life data become available in the coming days.

The ability of the testing procedure to identify true conditions was measured as the MCC (Eq. (3) and (5)), which is essentially a correlation coefficient measuring the strength and direction of the association between actual conditions (TP+FN and FP+TN) and the predicted outcomes (TP+FP and TN+FN), for both parts of the study. The MCC for the models were both close to zero, suggesting that the test assigned (Fig. 1) and would assign (Fig. 2) test results to subjects at near random. This, however, must be interpreted with caution as, although the MCC accommodates well scenarios in which the class sizes of the confusion matrix are very unbalanced (Boughorbel, Jarray & El-Anbari, 2017), there are some where one class is too small (such as the near-zero FN from both Figures) and short-circuits the measurement. The struggle to measure the performance of the models meaningfully using the MCC after all is related to the fact that the total numbers of actual positives (TP and FN) were disproportionately small compared to the overwhelmingly large total of actual negatives (TN and FP). This was not unexpected from the low prevalence (Eq. (2)).

The low prevalence (Eq. (2)) had given rise to the phenomenon that the models seemed to perform better at classifying negative test outcomes than they would positive. The probability of infection is simply very low, and thus a vast majority of cases would be actual negatives; there is simply an overwhelmingly high chance that the test would identify true negatives (TN) successfully. Its contrast with the very small number of false negatives (FN) was reflected in the negligibly low FOR (Fig. 1 and 2), which is preferable. However, a contrast of this magnitude also existed between the number of actual positives and that of actual negatives overall, once again due to the low prevalence (Eq. (2)), and affected the positive predictive power of the test in the other direction.

The low prevalence (Eq. (2)) impacted the PPV (Fig. 1 and 2) negatively by influencing the expected proportion of false positives (FP) out of all positive test outcomes. The PPV is very low for both models (Fig. 1 and 2) from both parts of the study, indicating that many of the positive outcomes from this testing procedure would be false positives (FP), despite the supposedly high specificity (99%) of the kit. If the number of false results is presented as a rate or a percentage, it represents the likelihood of a group of any size being given false results in this test procedure if the rate stays relatively constant through to the end of the programme. The numbers of FN and FP are thus performance measures by nature, showing that the test’s performance is intrinsically limited by the low prevalence (Eq. (2)).

The PPV is of significance because it can be used to describe the performance of this screening test. The F1 score (Eq. (4)) measures whether a test tolerates more false positives (dictated by precision, i.e. the PPV) or false negatives (dictated by recall, i.e. the sensitivity). The F1 was, in the case of this study and contrary to some suggestions (e.g. Chicco and Jurman, 2020), a more meaningful and informative measure of the test’s performance than the MCC. The low prevalence (Eq. (2)) resulted in a PPV that is close to zero. The F1 score (Eq. (4)) for the predictive model (Fig.2) of the study was very low, meaning that the balance had pivoted towards a toleration of false positives; this in turn characterised this medical test. The implication of this observation is that a significant amount of extra effort and resources would need to be spent on testing preliminary positive results to reassure the public that their positive results are true, and to tackle the problems foreshadowed by the base rate fallacy (Hamlett, 2020).

The aim of the mass screening programme was, according to the authorities, to discover the ‘silent carriers’ within the community accurately and sever transmission chains. The authorities proposed that there were 1,500 silent carriers of COVID-19 in the community (林鄭月娥:4成個案源頭未明 社區隱形患者傳播力高, 2020). Based on the predictive outcome from the second model (Fig. 2) of this study, as of the end of 6th September, the testing scheme would identify up to 269.99 silent cases (Eq. (6)) out of a population of 7.5 million, if a true whole-of-population screening was achieved. This is an extrapolation from the predicted results of the test for the 1.132 million who signed up for the scheme. The models and their predictions would have been refined if they took into consideration the demographics of the participants of the programme versus those who chose not to participate, and if the causes of the mismatch between the expected yield of clinically confirmed positive cases returned by the retrospective model (fig. 1) and the actual yield were known. As for now, 269.99 may be taken as a mathematically generated reference for the number of silent carriers among us at the moment.

Conclusion

This study has evaluated the performance of the Universal Community Testing Programme for COVID-19 in Hong Kong using simple predictive analytics, and has established that the low prevalence of the disease is the main challenge to the efficacy of the test from a mathematical perspective.

It is not yet known whether the mass screening programme would help or harm Hong Kong’s COVID-19 situation at the point where the city is already making its exit out of its third wave of outbreak and with an Rt of <0.5 signifying that the outbreak has already come under control (普及社區檢測計劃展開 許樹昌梁卓偉接受檢測, 2020). The directional effect of the programme is difficult to predict as it is unknown whether the number of infections would rise from close contacts between medics and the participants of the scheme and if they would carry pathogens out into the community. It is also unknown what impact it would have if carriers awaiting their results and medical personnel involved in the operation of the scheme still roamed freely within the community.

The programme targets asymptomatic and assumed healthy individuals while symptomatic patients are urged to seek medical help immediately. This means that the incidence rate for COVID-19 may or may not be monitored or calculated separately from the rolling outputs of the community test programme. It is not yet clear how the system categorises an individual who was healthy or asymptomatic but became infected or symptomatic between the sample collection and result notification. Such confusion of data could convolute good-natured mathematical work seeking to understand and contextualise the COVID-19 situation in Hong Kong.

Puzzling patterns within the data pending explanation have already been picked up by this study. When comparing the ‘test probability’ (Eq. (1)) with the population prevalence (Eq. (2)), the calculated ‘test probability’, which represents the current number of hits yielded by the BGI’s test after further retests by the Department of Health, turned out to be a lower value than the already very low population prevalence. Further research is required to statistically compare, and determine the significance of, the difference between the model outputs when the ‘test probability’ is used as an input and when the prevalence is used. Such studies would be a real test on the performance expected of the screening programme against reality, and provide extra information to the public for their judgement on whether large-scale programmes such as this as part of our long battle with COVID-19 are always worth their while.

A downloadable PDF of this report can be accessed at: https://drive.google.com/file/d/1g0kcZL5lnzehH4YUWPf76K-FlGtgqw7h/view

Bibliography

Apple Daily 蘋果日報. 2020. 【疫情焦點】港增21確診7冇源頭 外傭疑傳染同住九旬夫婦 另添3死(附個案搜尋器). [online] Available at: <https://hk.appledaily.com/local/20200906/CBVNTVMFMRFRFILSPPQFNBDXAQ/> [Accessed 6 September 2020].

Boughorbel, S., Jarray, F. and El-Anbari, M., 2017. Optimal classifier for imbalanced data using Matthews Correlation Coefficient metric. PLOS ONE, 12(6), p.e0177678.

Chicco, D. and Jurman, G., 2020. The advantages of the Matthews correlation coefficient (MCC) over F1 score and accuracy in binary classification evaluation. BMC Genomics, 21(1).

Hamlett, T., 2020. To Test Or Not To Test? That Is Not A Political Question | Hong Kong Free Press HKFP. [online] Hong Kong Free Press HKFP. Available at: <https://hongkongfp.com/2020/08/29/to-test-or-not-to-test-that-is-not-a-political-question/> [Accessed 6 September 2020].

Hong Kong’s Information Services Department. 2020. 普及檢測驗出六宗確診個案. [online] Available at: <https://www.news.gov.hk/chi/2020/09/20200903/20200903_172026_512.html> [Accessed 6 September 2020].

Hong Kong’s Information Services Department. 2020. 預約普及檢測人數增至逾113萬. [online] Available at: <https://www.news.gov.hk/chi/2020/09/20200906/20200906_213141_470.html#:~:text=%E6%99%AE%E5%8F%8A%E7%A4%BE%E5%8D%80%E6%AA%A2%E6%B8%AC%E8%A8%88%E5%8A%83%E9%80%B2%E8%A1%8C,2019%E5%86%A0%E7%8B%80%E7%97%85%E6%AF%92%E6%A0%B8%E9%85%B8%E6%AA%A2%E6%B8%AC%E3%80%82> [Accessed 6 September 2020].

Kwan, R., 2020. Covid-19: Hong Kong Cases Dip To Single Digits For First Time In 7 Weeks, As Mass Testing Registration Set To Begin | Hong Kong Free Press HKFP. [online] Hong Kong Free Press HKFP. Available at: <https://hongkongfp.com/2020/08/24/covid-19-hong-kong-cases-dip-to-single-digits-for-first-time-in-7-weeks-as-mass-testing-registration-set-to-begin/> [Accessed 6 September 2020].

Kyodo News+. 2020. Hong Kong Leader Chides Critics Of Universal Coronavirus Test. [online] Available at: <https://english.kyodonews.net/news/2020/08/0b0e7d7bc899-hong-kong-leader-chides-critics-of-universal-coronavirus-test.html> [Accessed 6 September 2020].

Liu, N. and Woodhouse, A., 2020. Hong Kong Covid-19 Mass Testing Sows Distrust Among Activists. [online] Financial Times. Available at: <https://www.ft.com/content/d9c6219c-4022-4f75-bc0c-73153ba6f4b5> [Accessed 6 September 2020].

Matthews, B., 1975. Comparison of the predicted and observed secondary structure of T4 phage lysozyme. Biochimica et Biophysica Acta (BBA) — Protein Structure, 405(2), pp.442–451.

News.rthk.hk. 2020. 林鄭月娥:4成個案源頭未明 社區隱形患者傳播力高 — RTHK. [online] Available at: <https://news.rthk.hk/rthk/ch/component/k2/1542386-20200807.htm> [Accessed 6 September 2020].

Now 新聞. 2020. 普及社區檢測計劃展開 許樹昌梁卓偉接受檢測. [online] Available at: <https://news.now.com/home/local/player?newsId=403872> [Accessed 6 September 2020].

University of Hong Kong, 2020. Ho Pak-Leung: Universal Tests Are Like Wasting Bullets. [online] Available at: <https://fightcovid19.hku.hk/ho-pak-leung-universal-tests-are-like-wasting-bullets/> [Accessed 6 September 2020].

Zhou, J., Pang, J., Zaharia, M. and Fernandez, C., 2020. Hong Kong Health Workers, Activists Urge Boycott Of Mass Testing. [online] U.S. Available at: <https://www.reuters.com/article/us-health-coronavirus-joshua-wong/hong-kong-health-workers-activists-urge-boycott-of-mass-testing-idUSKBN25Q0E0> [Accessed 6 September 2020].

--

--

Catrina Ko
The Startup

An international communicator writing, speaking and translating for freedom & democracy in one life, and a mad scientist in another.