dedicated web server by Leo Freyer
published:
Deutsche Ausgabe: Freyer, L. (2012). Robuste Rangfolgen.
ISBN 978-3-86386-953-3, amazon.de, Berlin.
intended for publication:
Shanghai ranking indicators – a spy game
DOI: https://doi.org/10.13140/RG.2.2.28461.60649 . Retrieved 07 December 2024.
DOI: https://doi.org/10.13140/RG.2.2.30257.29288 . Retrieved 08 December 2024.
Leo Freyer *), Derek Ian Clark
Abstract Listed evaluation data of university rankings from ShanghaiRanking Consultancy are available to the public. In a difference analysis the annual changes in all indicators were determined and pooled for the top 100 institutions. The analysis revealed a number of irregularities in the original data, some of which are presented and discussed here. The findings led to some fundamental considerations about the functionality and usability of Shanghai rankings, which are expressed here.
Keywords Researcher coverage • Ranking type distinction • Stabilisers • Discrepancies • Unknown patterns • Verification
JEL Classification C80, I23 – General Data Collection & Computer Programs, Higher Education
Mathematics Subject Classification 62H30 – Statistics, Classification and Discrimination
_____________
*) Leo Freyer
P.O. Box 132, CH - 4143 Dornach, Switzerland
e-mail: apo(AT(@))pczen.ch
ORCID registry: 0000-0002-8163-2848
https://orcid.org/0000-0002-8163-2848
_____________
Introduction
Rankings with several indicators are an attempt to map multidimensional relationships onto a sequence of natural numbers. Such highly projective procedures are sensitive to design flaws (Lilienfeld et al. 2000). If projection is used, imaging errors should be avoided as far as possible. Rankings are used to make many, possibly far-reaching judgements in the course of a machine sorting process. It is legitimate to form and express your own opinion about these common tools of simplification.
University rankings are widely recognised in the education sector. The special form of Shanghai rankings now has a tradition of more than twenty years standing and is well known in the education sector. Interested readers will most likely be familiar with rankings from numerous research articles on the subject. Instead of another introduction with many cross-references here, the informative and comprehensively sourced review by Fernández-Cano et al. (2018) is highly recommended reading. Open access articles on the topic are labelled in References with their digital object identifier (DOI) or weblink.
Shanghai rankings were initially referred to exclusively as Arwu (Academic ranking of world universities). The newer term seems more appropriate as it is non-pleonastic and source-related. In addition, the more practical acronym continues to be used. In this article, both terms are used alternately and interchangeably.
As a pharmacist and natural scientist, I have no professional involvement with any ranking organisation. None of the authors has a conflict of interest on this subject. All opinions expressed here are solely to be attributed to me as the initial author. They only ever refer to the Western education sector, especially that of Switzerland.
We have endeavoured to use open, concise language in accordance with APA (2020). We just want all information to be understood correctly in context. It is in no way our intention to offend anyone personally. All we are looking for are new ideas through a sporting competition. This article in particular is not written for the sake of argument. As I initially supported the novel Arwu, this text now serves to clear my way for ongoing research in the spirit of Wolfram (2002; 2022). In relation to the article, Wolfram style means using primarily graphics for the presentation and as few formulae as possible. No special skills are required to understand the basic content.
ShanghaiRanking Consultancy makes a scientific claim, as stated on their website (ShanghaiRanking 2023a). Transparency, comprehensibility and accuracy are indispensable prerequisites for science as we understand it. ShanghaiRanking's issues with its claimed transparency and with accuracy are discussed below. The many subjective weights without giving reasons are an obstacle to comprehensibility. This and many other problems, which Fernández-Cano et al. (2018) have compiled, raise the first, serious question. Where does science come into play?
Material and methods
Data
The Shanghai rankings of the top 100 institutions consist mainly of world famous universities as well as some technical colleges, e.g. the Massachusetts Institute of Technology. Some institutions specialised in humanities and social sciences such as London School of Economics can also be part of this list.
The Shanghai ranking data have been restructured for our purposes. Each such dataset consists of:
1. a 7-digit identifier built-up of the year surveyed plus a consecutive index from 001 to 100,
2. the rank of the academic institution according to ShanghaiRanking,
3. the name of the institution,
4.-9. all six raised indicators – see below for further details – as well as
10. the total score as calculated by ShanghaiRanking.
The following full definitions of the indicators are provided for clarification of the text if needed. This section also contains our comments on possible consequences of the definitions.
Alumni, weight 10% : "Number of the alumni of an institution winning Nobel Prizes and Fields Medals. Alumni are defined as those who obtain bachelor's, master's or doctoral degrees from the institution. Different weights are set according to the periods of obtaining degrees. The weight is 100% for alumni obtaining degrees after 2011, 90% for alumni obtaining degrees in 2001-2010, 80% for alumni obtaining degrees in 1991-2000, and so on, and finally 10% for alumni obtaining degrees in 1921-1930. If a person obtains more than one degree from an institution, the institution is considered once only" (ShanghaiRanking 2023b).
Award, weight 20% : "Number of the staff of an institution winning Nobel Prizes in Physics, Chemistry, Medicine and Economics and Fields Medal in Mathematics. Staff is defined as those who work at an institution at the time of winning the prize. Different weights are set according to the periods of winning the prizes. The weight is 100% for winners after 2021, 90% for winners in 2011-2020, 80% for winners in 2001-2010, 70% for winners in 1991-2000, and so on, and finally 10% for winners in 1931-1940. If a winner is affiliated with more than one institution, each institution is assigned the reciprocal of the number of institutions. For Nobel prizes, if a prize is shared by more than one person, weights are set for winners according to their proportion of the prize" (op. cit.).
As the awarding of Nobel prizes or Fields medals are rare events, only very few new data entries are expected every year. The definitions suggest that the number of data entries with Award is lower as compared to Alumni. By definition all internal weights are adjusted only once every decade.
HiCi, weight 20% : "Number of Highly Cited Researchers selected by Clarivate. The Highly Cited Researchers list issued in January 2023 was used for the calculation of HiCi indicator in Arwu 2023. Only the primary affiliations of Highly Cited Researchers are considered" (op. cit.).
Clarivate (2023) specified that "of the world's population of scientists and social scientists, Highly Cited Researchers™ are 1 in 1'000". So at a university with 1000 researchers, on average only one of them is decisive for the HiCi value of their institution.
N&S, weight 20% : "Number of research articles published in Nature and Science between 2018 and 2022. To distinguish the order of author affiliation, a weight of 100% is assigned for corresponding author affiliation, 50% for first author affiliation (second author affiliation if the first author affiliation is the same as corresponding author affiliation), 25% for the next author affiliation, and 10% for other author affiliations. When there are more than one corresponding author address, we consider the first corresponding author address as the corresponding author address and consider other corresponding author addresses as first author address, second author address etc. following the order of the author addresses. Only publications of 'Article' type are considered" (ShanghaiRanking 2023b).
Given the large number of valid publication journals, the proportion of researchers covered by N&S is probably also in the per mille range. None of the previous indicators includes a majority of scientists employed by an institution. In short, two thirds of the Arwu indicators leave most of the research at any university largely unrecorded. This selective perception is more suited to a research ranking than a comprehensive university ranking. So far, Arwu is more of a pure top-level research ranking. The two remaining indicators can hardly make up for this imbalance.
PUB, weight 20% : "Number of papers indexed in Science Citation Index-Expanded and Social Science Citation Index in 2022. Only publications of 'Article' type are considered. When calculating the total number of papers of an institution, a special weight of two was introduced for papers indexed in Social Science Citation Index" (op. cit.).
This is the only indicator for which high researcher coverage can be assumed. Researcher coverage refers to the proportion of all researchers at an institution recorded by an indicator. Up-and-coming universities score particularly well on PUB. It is considered the 'door opener' among indicators.
PCP (Per Capita Power), weight 10% : "Weighted scores of the above five indicators divided by the number of full-time equivalent academic staff. If the number of academic staff for institutions of a country cannot be obtained, the average number of academic staff for world top 1000 universities will be used for all institutions in this country. For ARWU 2023, the numbers of full-time equivalent academic staff are obtained for institutions in USA, UK, China, France, Canada, Japan, Italy, Australia, Netherlands, Sweden, Switzerland, Belgium, South Korea, Czechia, New Zealand, Saudi Arabia, Spain, Austria, Norway, Poland, Israel etc." (op. cit.).
As a result, there are two different ways of calculating PCP. It is not possible to look up which algorithm was used, as Fernández-Cano et al. (2018) previously critisised. The denominator for the second type of calculation is an annual constant that has never been disclosed. With the alternative calculation possible effects of missing data on ranks can initially be avoided. If an institution subsequently provides the requested information, this could lead to a sudden change of rank. PCP does not show the usual precision of definitions, which makes it difficult to apply correctly. The deliberate lack of information in this central area may invite interested parties to take a closer look, see the Results section. PCP is considered the 'black box' among indicators.
Software
The programs to be presented were developed in the Wolfram Language (Wolfram 2023) up to the current version 14.0 . Matrices containing different types of data can be processed with it. Every year surveyed is managed as a matrix of 1000 data cells.
The programs follow the paradigm of data-oriented programming according to Sharvit (2022). It primarily means that all data from external sources remain stored outside of the program. This architectural decision alone makes the fully modular programs even easier to maintain. The separate data storage files are less susceptible to unintentional changes and are write-protected. Treating the data as independent leads to clear data-related structures within the programs. This in turn has several advantages like improved adaptability, testability, traceability of all transformations at runtime as well as shorter development cycles.
Key figures. The indicator key figures were calculated for all data 2004-2023 pooled. They were recalculated annually to obtain a measurement for the ranges, and also per decade for longer-term developments. The multi-year figures of the modes were averaged from the annual determinations. The skewnesses of distributions were calculated according to Pearson as (mean - median) / standard deviation (Walz 2004). The frequency distributions of the indicators were summarised and displayed.
Difference analysis. The original aim of this analysis was to find out where on the character scale from 0 to 100 mainly positive differences occur, which means indicator improvements. The program analyses the distributions and the annual differences of the indicators across a total of 101 characteristic classes [0...100]. Each individual indicator or a group of indicators can be analysed for the years under review. The first step is to search for all duplicate entries of institutions in two consecutive years. From this set the differences for each indicator are determined per institution. The differences are compiled in relation to the class values of the prior year. They are visualised as dot plots or histograms. A pre-selectable search limit allows to subsequently assign individual data points to their institutions. The search continues for the next two years and so on. In each round, position parameters, range and frequencies of the differences per indicator are recorded and all distributions are checked for normality. Such tests give a rough indication of the randomness of a distribution. The computing system automatically searches for the most powerful normal distribution test for the data in question (Wolfram Research 2010).
To deal with zero differences: A zero difference that occurs because an indicator has no value in two consecutive years will not be taken into account. On the other hand, a zero difference due to values of the same non-zero size will be taken into account in the ongoing calculations.
Full-time positions. The program examines the development of full-time positions or equivalents at a top 100 university (i) as an input. The period under consideration is also entered. Full-time positions and annual changes are calculated as percentages of the starting year. A variable threshold captures major annual changes. All required indicator data are exported as a spreadsheet for alternative replication.
The equation for PCP was set up according to its definition (ShanghaiRanking 2023b) and solved for the unknown full-time equivalents (FTE).
FTE = (Alumni/2 + Award + HiCi + N&S + PUB) / PCP
The result was set in relation to the strongest competitor (α), in this case Caltech, in order to remove the influence of the annually changing standardisation.
FTEi(rel.) = FTEi / FTEα
Standardisation refers to the calculation of relative characteristics from absolute ones, also known as normalisation (Tofallis 2012). If a particular university is not on the list in a given year, the computing system issues a message and moves on to the next year. This tolerance for errors allows the program to work its way through incomplete entries.
Results
Key figures
The indicator data are concerned with relative characteristics. They all cover the same scale range, from 0 to 100 to one decimal place. Thus all characteristics can have an equally strong influence on the overall score. The original absolute values were put into a ratio, in the case of Arwu this means relative to the strongest expression.
If we look at the frequency distributions of the six indicators, they can be divided into two groups, see Figures 1 and 2.
The first group consists of the extremely right-skewed indicators Alumni, Award and PCP. All their medians are in the bottom third of the scale and the upper half is almost empty. There are only a few universities that can achieve high scores here. Most values remain relatively constant over a two-year period. These indicators are hardly accessible from a competitor's perspective. We call them stabilisers (Figure 1).
Fig. 1 Stabilisers. The frequency distributions of the indicators No. 1, 2 and 6, i.e. Alumni, Award and PCP from 2004 to 2023 with their medians at 23.0, 27.1 and 29.3 respectively
The second group has less skewed distributions and consists of the indicators HiCi, N&S and PUB. In contrast to the first group, they measure performance by means of citations. We call them performance indicators (Figure 2).
Fig. 2 Performance indicators. The frequency distributions of the indicators No. 3, 4 and 5, i.e. HiCi, N&S and PUB from 2004 to 2023 with their medians at 32.4, 28.6 and 54.4 respectively
Table 1 serves to quantitatively substantiate this group distinction. The probability p(0) that an indicator contains zeros instead of positive values is given as a percentage. Zero values occur in Alumni, Award and to a much lesser extent in HiCi. They have become more frequent over time for each of these indicators. Their coverage of institutions has therefore become sparser. The very frequent zeros in Alumni and Award would have a biased influence on their key figures. Comparisons with the other indicators would be distorted, as can be seen in the modes. Consequently, zero positions became excluded from all calculations. Otherwise the mode for both indicators mentioned would be zero.
The standard deviation formula is susceptible to asymmetric inputs. For skewed or sparse distributions, the median deviation in Table 1 is more suitable. Wolfram Research (2007) states "MedianDeviation[data] gives the median absolute deviation from the median of the elements in data. MedianDeviation is a robust measure of dispersion, which means it is not very sensitive to outliers".
In order to use homogenuous data, the database was experimentally divided into subsequent decades. From 2004-2013 compared to 2014-2023 most key figures remained the almost same. The figures change only slightly each year. In the longer term, there are the more volatile performance indicators HiCi, N&S and PUB. These indicators have all become less skewed over the decades. The near-zero skewness of PUB in Table 1 means that its distribution now most closely resembles a normal distribution.
Mode, median and arithmetic mean all increased for PUB while they decreased for Alumni. This corresponds to the signature of Chinese universities which have been increasingly represented in the top selection since 2016.
Table 1 Key figures of Shanghai ranking indicators. The small print numbers in brackets show the extremes from annual calculations
label & |
period |
p(0) [%] |
arith- |
median |
mode |
stand- |
varia- |
median |
std. |
Pear- |
median |
Alumni 1 |
2004 - 2023 |
12.2 |
28.0 (25.9 ... |
23.0 (20.0 ... |
17 (12 ... |
17.5 (16.77 ... |
0.65 (0.58 ... |
8.25 (6.9 ... |
2.12
|
0.29 (0.23 ... |
79 |
2004 - 2013 |
9.3 |
29.0 |
24.2 |
18 |
17.7 |
0.65 |
8.6 |
2.06 |
0.27 |
90 |
|
2014 - 2023 |
15.1 |
26.8 |
21.8 |
15 |
17.2 |
0.65 |
8.0 |
2.15 |
0.29 |
75 |
|
Award 2 |
2004 - 2023 |
17.7
|
32.7 (31.3 ... |
27.1 (24.4 ... |
20 (16 ... |
20.8 (19.57 ... |
0.65 (0.62 ... |
8.3 (7.55 ... |
2.51 |
0.27 (0.24 ... |
100 |
2004 - 2013 |
16.8 |
31.8 |
26.0 |
19 |
20.1 |
0.65 |
8.2 |
2.45 |
0.29 |
90 |
|
2014 - 2023 |
18.5 |
33.7 |
27.6 |
22 |
21.4 |
0.65 |
8.6 |
2.49 |
0.29 |
95 |
|
HiCi 3 |
2004 - 2023 |
1.15
|
35.2 (32.5 ... |
32.4 (30.5 ... |
26 (21 ... |
15.0 (12.8 ... |
0.4 (0.37 ... |
8.7 (7.05 ... |
1.72 |
0.19 (0.07 ... |
171 |
2004 - 2013 |
1.1 |
36.2 |
32.4 |
24 |
16.2 |
0.45 |
9.6 |
1.69 |
0.23 |
141 |
|
2014 - 2023 |
1.2 |
34.2 |
32.1 |
29 |
13.7 |
0.4 |
7.5 |
1.83 |
0.15 |
214 |
|
N&S 4 |
2004 - 2023 |
0
|
32.6 (31.6 ... |
28.6 (27.7 ... |
22 (18 ... |
15.0 (14.63 ... |
0.45 (0.44 ... |
8.1 (7.3 ... |
1.85 |
0.27 (0.19 ... |
106 |
2004 - 2013 |
0 |
32.7 |
28.4 |
22 |
15.2 |
0.45 |
8.35 |
1.82 |
0.29 |
98 |
|
2014 - 2023 |
0 |
32.4 |
28.8 |
22 |
14.9 |
0.45 |
7.9 |
1.89 |
0.24 |
120 |
|
PUB 5 |
2004 - 2023 |
0
|
55.3 (52.7 ... |
54.4 (51.3 ... |
51 (44 ... |
12.9 (11.95 ... |
0.25 (0.22 ... |
8.2 (7.15 ... |
1.57 |
0.07 (0.0 ... |
777 |
2004 - 2013 |
0 |
54.0 |
52.5 |
48 |
12.3 |
0.23 |
7.8 |
1.58 |
0.12 |
438 |
|
2014 - 2023 |
0 |
56.7 |
56.0 |
54 |
13.3 |
0.25 |
8.0 |
1.66 |
0.05 |
1120 |
|
PCP 6 |
2004 - 2023 |
0
|
32.5 (29.5 ... |
29.3 (25.9 ... |
27 (22 ... |
13.0 (12.31 ... |
0.4 (0.38 ... |
4.7 (3.7 ... |
2.76 |
0.25 (0.15 ... |
117 |
2004 - 2013 |
0 |
31.4 |
28.3 |
25 |
12.8 |
0.42 |
4.7 |
2.72 |
0.24 |
118 |
|
2014 - 2023 |
0 |
33.6 |
30.2 |
29 |
13.0 |
0.4 |
4.55 |
2.86 |
0.26 |
116 |
|
Total |
2004 - 2023 |
0
|
36.6 (35.9 ... |
31.8 (30.6 ... |
26 (24 ... |
13.3 (12.53 ... |
0.35 (0.34 ... |
5.1 (4.45 ... |
2.61 |
0.36 (0.3 ... |
88 |
With PCP the numerator as a sum varies less than the underlying indicators. The number of full-time equivalents usually changes only slightly within a year. A relatively small PCP dispersion can be expected. Outliers are to be expected once if the method of calculation changes thanks to the data received on full-time equivalents. The scattering of this indicator over the years should decrease as more data on full-time equivalents become available. To check whether the dispersion of PCP has really decreased over time, median deviation was used. With this, the dispersion remained fairly constant over the first fifteen years. However, it was always well below the other indicators, see Table 1. The expected decrease has only appeared during the second decade, as compared in column 9. For the different ways of calculating PCP, a bimodal frequency distribution would have been expected. In contrast, PCP (Figure 1) is impressive with the highest and steepest peak of all indicators.
All indicators are skewed to the right to varying degrees, which is reflected in different positive skewness values (Table 1). The position of the median says something about the accessibility of an indicator. If median and skewness are set in relation, the differences between the indicators increase (column 12). In this way changes in accessibility over time are accentuated. The ratio has increased for all performance indicators over the decades. In the case of PUB it has more than doubled. As with Alumni and Award the accessibility measure has hardly changed at PCP. This property is suitable for stabilisers. The otherwise rather opaque PCP indicator reveals something about its nature.
In addition, both groups of indicators can be tentatively split by the ratio standard deviation divided by the median deviation (Table 1). This factor is above two for the stabilisers and below for the performance indicators.
Discrepancies
Data availability statement. If anyone wants to verify the following objections or conduct their own research, they can demand access to our data package. All possible formats are supported.
General. This section summarises a number of conspicuous patterns and inconsistencies that suggest data manipulation at ShanghaiRanking. According to the Introduction, no attempt was made to provide irrefutable evidence of this. It would have required looking at the raw data from the universities concerned in the years in question. However, it is difficult to imagine that such an effort could dispel the numerous suspicions. The examination tools and instructions (Florian 2007; Docampo et al. 2022) would be available. For now, the judgment on accuracy and reliability of Shanghai rankings is left to the reader.
Key figures of differences. Over the period 2004-2023 institutions that occur in two consecutive years were extracted. The yield was 96 percent of all possible cases. The differences were defined as the values of the later year minus those of the previous year.
Table 2 provides an overview of the differences (Δ) per indicator. Columns 7-9 show the mean absolute frequencies of positive, negative and zero differences per comparison, rounded to the nearest whole number. Columns 10 and 11 contain the largest positive and negative differences, which include the outliers of all comparisons. Column 12 shows the percentage of rejected null hypotheses (H0) in tests for normal distribution with the usual probability of error (p = 0.05).
Table 2 Two-year differences (Δ) with statistical key figures. The data are based on all subsequent year comparisons 2004-2023. Sample size = 19. Columns 2-9 contain averaged figures
indica- |
arith- |
median |
stand- |
median |
std. |
frequencies |
maxi- |
mini- |
H0 |
||
Δ > 0 |
Δ = 0 |
Δ < 0 |
|||||||||
Alumni |
-0.26 |
-0.44 |
1.56 |
0.22 |
7.13 |
11 |
14 |
60 |
17.8 |
-28.9 |
100 |
Award |
-0.03 |
-0.18 |
1.68 |
0.14 |
12.3 |
8 |
41 |
31 |
15.0 |
-29.4 |
100 |
HiCi |
-0.32 |
-0.43 |
2.58 |
1.46 |
1.77 |
39 |
11 |
45 |
17.9 |
-15.0 |
57.9 |
N&S |
-0.05 |
-0.07 |
1.55 |
0.95 |
1.63 |
45 |
4 |
48 |
8.9 |
-6.9 |
21.1 |
PUB |
-0.21 |
-0.24 |
1.51 |
0.78 |
1.94 |
46 |
4 |
46 |
15.3 |
-14.3 |
73.7 |
PCP |
0.12 |
0.11 |
2.17 |
0.84 |
2.57 |
51 |
4 |
40 |
30.4 |
-18.3 |
89.5 |
What is striking in Table 2 for both Alumni and Award are the very few zero differences. One could expect the following: According to the definitions, a vast majority of zero differences and medians close to zero were expected for 90 percent of the comparisons. Negative position parameters would occur only every tenth time due to the adjustment of damping factors. The corresponding values in Table 2 should also differ by a factor of 9 (columns 8 and 9). Instead, both ratios are much smaller or even reversed, namely ≈ 0.23 (= 14 / 60) for Alumni and ≈ 1.32 (= 41 / 31) for Award.
As shown in column 12, Alumni and Award were not expected to pass distribution fit tests (Wolfram Research 2010) for normality. Passing here means that the null hypothesis of indistinguishability (H0) was not rejected in a test at the usual significance level p = 0.05 . In contrast, all three performance indicators were expected to easily pass the test due to normally distributed differences. The best of them was N&S with 21.1 percent dropouts. Surprisingly, the most widespread key player PUB came in last with a failure rate of 73.7 percent.
Normal distributions are associated with position parameters close to zero and balanced proportions of positive and negative deviations. Instead, HiCi shows a negative overall trend for both the position parameters and the frequencies.
The opaqueness of the PCP indicator meant that it hardly allowed for any expectations. If one takes 89.5 percent of rejected null hypotheses (column 12) as a benchmark, PCP is closer to the stabilisers than to the performance indicators.
Illustrations of differences. The annual differences were visualised as relations to the previous year's figures. Surprisingly, this search also visually reported instances of suspected data manipulation. The new evidence seems remarkable in that it allows a glimpse into the machinery of ShanghaiRanking. Furthermore these findings appear to be incompatible with the claimed reproducibility by Docampo and Cram (2014). Some of the resulting graphics seem completely unfamiliar, e.g. Figure 6. We would like to share some examples of these with the reader.
Liu and Cheng wrote in 2007 "The distribution of data for each indicator is examined for any significant distorting effect; standard statistical techniques are used to adjust the indicator if necessary" (p. 178). The repetition of this statement on their website (ShanghaiRanking 2023b) does not make it any clearer. A change in distribution requires interventions at the data level. As far as we know, there are no such 'standard techniques' other than deleting outliers. In accordance with the GLP rules (OECD 2021), raw data may only be excluded but never made to fit.
Based on the new findings the tortuous statement by Liu and Cheng now looks quite interpretable in a time-related manner. ShanghaiRanking (2023a) most recently wrote "One of the factors for the significant influence of ARWU is that its methodology is scientifically sound, stable and transparent". The analogously claimed stability of the rankings, which has been doubted elsewhere (Freyer 2014), has now proved to be untenable.
Two types of deviations from expectations were found, two-year patterns and long-term systematic changes. A two-year pattern describes an noticeable deviation that affects two consecutive years. Systematic changes affect the majority of the reporting years and largely contradict the indicator definitions of Alumni and Award.
Alumni. For Alumni positive differences arise only through the new entries of award winners. Negative differences should occur every ten years when the indicator weights were adjusted. This is not what was found. Instead, negative differences occured every year except in 2016 and 2018. Their yearly frequencies were usually in the double-digit range well above 50 (e.g. Fig. 3, Table 2).
By definition, the attenuations should lie on an straight line with a 10 percent incline. The self-imposed target of a 10 percent reduction was never met by either Alumni or Award. The sloping lines in Figures 3, 4 and 5 serve as orientation aids for the damping process. Points below the 10 percent treshold should never occur and should be explained. Furthermore in 2019 (Figure 3), there should have been no negative differences at all.This means that all points below the zero line are doubtful and unexplained.
Like Award, Alumni has also a problem with too few positive data points. This was the case at least in 2006, 2008, 2009 and 2012. Since virtually all of the laureates have bachelor's, master's and doctoral degrees, by definition each of them should contribute up to three data points. Such absences are presumably errors.
Fig. 3 Alumni in 2019. Attenuation steps of 2 percent are shown as auxilary lines. All negative y-values are questionable for various reasons mentioned in the text
Award. According to the Award definition negative differences should only occur every ten years. There were no negative y-values in just under 37 percent of cases instead of 90 percent as expected. An annual adjustment of the damping factors for Alumni and Award results in lower shocks than only every ten years. The question arises as to why the ranking team does not simply adjust the weight definitions to their annual recalculations. The next, more serious question is: Why haven't they been doing this for 20 years?
There is also a lot wrong with the positive changes at Award. In 2011, which became effective in 2012, nine scientists received a Nobel prize that falls under the Award definition. There were no Fields medals at that time. Arwu subsequently reported a total of 50(!) new Nobel prize winners according to the number of positive differences. That looks like data fabrication. This finding is illustrated in Figure 4. So practically only errors are displayed there, but these are not exceptions.
Fig. 4 Award in 2012. The chart captures all the errors mentioned in the text. This applies to all negative and the majority of positive y-values
The reverse also occurs. In most cases, a Nobel prize is shared among several researchers in at least one of the disciplines relevant to Arwu: medicine, chemistry, physics and economics. Nevertheless, the rankings for 2006, 2008, 2015, 2016 and 2023 show only 4 or fewer positive data points from the previous year. For example, ten Arwu-relevant Nobel prize shares and four Fields medals were awarded in 2022. Nevertheless, the subsequent ranking provides only 4 entries, see Figure 5.
Fig. 5 Award in 2023. Only 4 of 14 prize winners emerged as positive data entries. A damping step of approx. 4 percent incline is evident. By definition, no such slowdown was planned for 2023
What are the reasons for the few inputs? The available prizes are not distributed exclusively among the universities analysed. Retired Nobel laureates may not come under Arwu's definition of Award. The average age of Nobel prize winners is over 60, and the trend is rising (Bjørk 2019). Many of them are therefore excluded from the outset by this already sparse indicator. If the award winners had still been employed instead, their universities would have received a point with a hundred years of after-effects. The fact that a coincidence can have such a huge impact shows the inadequacy of the Award definition.
The number of staff or full-time equivalents is part of the definition. This means that Award cannot be verified, contrary to the Berlin principles (No. 11) on data collection. Non-verifiable also means not easily recalculable, which strongly calls for revision.
In light of these many inconsistencies, it is concluded that Shanghai rankings are unstable according to their own definitions.
HiCi. For HiCi 2013 the ranking team adopted all the data from HiCi 2012 except for one outlier. In this way the differences disappeared. So the indicator was silenced. The resulting rank order thus appeared more stable over time. 2015 was noticeably quiet again. Around two thirds of the HiCi differences were then zero, suggesting stability. The extensive lack of differences is unexplained in both cases, since zero differences for institutions were otherwise the exception.
All differences were plotted in relation to the previous year's values. In this way line patterns appeared repeatedly, e.g. 2005, 2006, 2007 (Figure 6), 2008, 2009 and 2017, to mention the more obvious ones.
Compared to the human eye, normality tests are a weak tool for pattern recognition. To their credit, they are impartial and can be automated. With regard to Figure 6, the distribution fit test (Wolfram Research 2010) decided as follows: "The null hypothesis that the data is distributed according to the normal distribution is rejected at the 5 percent level based on the Cramér-von Mises test". Is the test possibly deceptive and is the unusual pattern actually the result of chance? Perhaps if it were an isolated finding, but as just mentioned, this is not the case.
Fig. 6 HiCi in 2007. Calculated differences alone mysteriously led to a hand-shaped pattern. Multicolouring for accentuation was not used here. Nevertheless negative y-values within the pattern have their own meaning, see text
Data manipulation once assumed, what could be the reason for the hand-shaped patterns in HiCi? Currently there seems to be a mystery, a man-made one of course. The patterns are mainly prominent for small characteristic values, where the differences appear disproportionately large. Could an attempt to avoid too many tied ranks look like this? The local tie frequencies should be proportional to the flatness of the respective characteristic curves. A pilot study with all data ties could show whether there are enough of them according to local expectations. A more playful approach to understanding the motivation behind these patterns would be to create them synthetically. System tests, see below, also provide patterns that can be moulded into the desired shape. So the mystery can probably be solved if you want to dig further.
Patterns with negative y-values as in Figure 6 look suspicious in that certain universities have been moved to worse positions. This would also affect a number of weaker competitors below the HiCi equilibrium position of 32.4 (Table 1).
After 2017 conspicuous patterns disappeared and since then tests for normal distribution have mostly been passed. However, other interventions behind the scenes are likely to continue, as nothing has officially changed in terms of methodology.
PCP. 'Per Capita Power' is not transparent due to the unknown full-time equivalents (FTE) and further clouds Arwu's picture of reproducibility. Its opacity means that adjustments to fix ranks are here undetectable from the outset. Strikingly, there were very few years with positive and negative differences in equilibrium. In most cases the proportions were greatly one-sided with changing signs. Figure 7 serves as an example.
Fig. 7 PCP in 2007. As with most plots, there are noticeably shifted proportions of PCP differences. Overlaps due to mapping conceal an even stronger one-sidedness of 17 : 81 in favour of negative differences
According to the Berlin principles (CEPES 2006) PCP would require an extensive redesign regarding data origin (No. 4), transparency (No. 6), verifiability (No. 11) and error correction (No. 16). These principles do not demand anything new, but merely refer to general due diligence when handling data (OECD 2021). The fact that no such rework has taken place despite longstanding reminders from colleagues is a good reason for scepticism.
The standardisation of PCP is particularly questionable in the main type of calculation where an institution has provided its FTE data. Again, the figures are standardised according to the strongest competitor, which is known to be a weak point (Docampo and Cram, 2014). An additional complicating factor here is that the information on standardisation comes from a foreign source, namely Caltech. It cannot be verified precisely by the ranking team and its impact on all other data must be accepted. Using Chinese universities for standardisation would now be the method of choice.
Program verification
It is theoretically conceivable that all patterns found are artefacts, i.e. that they were caused solely by errors in the program. Of course, overlooked sources of errors remain possible. The program for displaying differences now probably produces at most a few errors in details – for various reasons.
On the one hand, the program remained simple because its high-level language can handle intersections and complements. So finding the matching pairs of two years and calculating the changes was no big deal.
Furthermore, a common test battery ensured that the program worked properly in a similar way to an experiment. These system tests are referred to as blind test, calibration test and cross-check. Blind tests were realised by allowing the program to use the same reporting year twice together as inputs. A passed blind test results in 100 percent matching pairs, all with zero differences. Various calibration tests were created by replacing existing input data in a controlled manner. The calibration data were saved separately and fed into the program instead of standard inputs. For a calibration test to pass, defined differences and their dot plots must be returned according to the prepared data. The third type of system test called cross-check is the easiest to set up. To do this, the inputs were simply swapped. All results should then be equal in amount but inverted. The program worked flawlessly with each of these tests.
Another argument against artefacts arises from the different types of aberrations. An example of different types in Award are, firstly, inconsistent quantities of data entries and, secondly, too frequent damping lines, each with a specific inclination (e.g. Figures 4 and 5). For any given indicator, the full reproducibility of all these patterns over the years proceeded with the inevitability of a film sequence.
Once an overview of the data had been obtained, alternative replication was possible without programming. The original data for two years were copied directly into a spreadsheet. All differences were calculated and plotted there. The graphs in the spreadsheet did not differ from those in the program, e.g. Figures 3 to 7. The program and its results have been verified using the alternative solution method.
Full-time positions
The full-time equivalents (FTE) at two nearby institutions served as a plausibility check for PCP. ETH and the University of Zurich both show a complete data set in Arwu's top 100 rankings. ETH means the Swiss Federal Institute of Technology Zurich. Table 3 gives an overview of the results.
Table 3 Percentage of full-time equivalents (FTE) for two Swiss universities 2004-2023. All data were provided by ShanghaiRanking. The base year 2004 was taken as 100 percent. All FTE figures are given as a percentage
year |
ETH |
University of Zurich |
||
FTE |
ΔFTE |
FTE |
ΔFTE |
|
2004 |
100 |
0 |
100 |
0 |
2005 |
76.3 |
-23.7 |
96.4 |
-3.6 |
2006 |
76.3 |
0 |
97.1 |
0.7 |
2007 |
76.5 |
0.2 |
96.9 |
-0.2 |
2008 |
75.6 |
-0.9 |
96.9 |
0 |
2009 |
75.9 |
0.3 |
97.3 |
0.4 |
2010 |
94.4 |
18.5 |
121.7 |
24.4 |
2011 |
95.9 |
1.5 |
125.2 |
3.5 |
2012 |
98.2 |
2.3 |
126.9 |
1.7 |
2013 |
104.1 |
5.9 |
127.4 |
0.5 |
2014 |
106.4 |
2.3 |
137.4 |
10.0 |
2015 |
106.5 |
0.1 |
137.1 |
-0.3 |
2016 |
111.5 |
5.0 |
113.2 |
-23.9 |
2017 |
110.0 |
-1.5 |
110.3 |
-2.9 |
2018 |
108.9 |
-1.1 |
112.0 |
1.7 |
2019 |
110.0 |
1.1 |
110.8 |
-1.2 |
2020 |
111.5 |
1.5 |
99.7 |
-11.1 |
2021 |
110.4 |
-1.1 |
101.8 |
2.1 |
2022 |
110.1 |
-0.3 |
99.6 |
-2.2 |
2023 |
112.1 |
2.0 |
107.0 |
7.4 |
Arwu's information on ETH and the University of Zurich is hard to believe. None of these universities has ever experienced a mass dismissal, especially not repeatedly in the double-digit range (Table 3, column 5). Mass recruitment on the same scale is also assumed for both institutions. This is practically impossible, as there were never enough qualified people available.
From a local perspective, it is untenable that the University of Zurich would hardly have reached the FTE number it had in 2004 by 2022 (column 4). In fact, both universities have seen steady growth by any measure over the decades. They are in close proximity to each other and could not pursue an independent personnel policy.
There are a number of top universities within the same administrative area, e.g. in Munich, Paris and London. Perhaps correlations between full-time equivalents over time could be demonstrated there. At present, Arwu's job statistics have no obvious connection to reality, at least not by Swiss standards.
A change in the number of full-time positions of more than ±10 percent in any one year is otherwise the rule for the top selection, including the leader Harvard in 2005. Annual changes in full-time equivalents of more than 40 percent occur several times. Quadruple fluctuations in the double-digit range as shown in column 5 are not exceptional.
With such suddenly fluctuating personnel data, PCP is probably not performance-related, as 'Per Capita Performance' would have you believe. Here we see the third, obviously necessary stabiliser. Again, the indicator groups can be broken down by standard deviation divided by the median deviation (Table 2, column 6), now using the differences. This confirms PCP's classification as a genuine stabiliser.
Last but not least, the above mislabelling is a common feature of Arwu's stabilisers. Describing Alumni as current 'Quality of Education' (ShanghaiRanking 2023b) and Award as 'Quality of Faculty' (op. cit.) are euphemisms at best.
In short, half of the indicators in Arwu are not there to 'indicate' anything. They are purely auxiliary supports.
Synopsis
The actual reason for the decades-long after-effect of Alumni and Award is their stabilising effect. Despite this built-in inertia, continued hidden intervention is necessary as suggested by the many findings such as inexplicable patterns and omissions of data. Summarising our analysis, an overwhelming majority of years and all 4 indicators considered have been compromised. The irregularities in the data all have the effect of preventing or reducing rank changes in order to conceal the inherent instability. The deeper the investigation, the more weaknesses came to light. As these findings did not diminish, the search was deliberately ended at the third stabiliser. For example, exotic patterns for full-time equivalents were thus excluded. The high rejection rate of the ubiquitous PUB was not researched at all. The irregularities would probably have continued there.
Arwu is largely a ranking of top-level research by design. Chinese universities were not represented in the top 100 list before 2016, which is why the actual findings mainly affect everyone else. In Shanghai rankings, the established Western universities merely form the higher education landscape. Deriving their exact positions from this is an off-label use that the existing methodology cannot cover. Those responsible at ShanghaiRanking Consultancy have failed to warn in any way about such restrictions. Despite their general scientific claim, there is no recognisable attempt to reduce or even quantify the numerous subjective influences.
Discussion
Leiden Rankings
Are there any university rankings without subjective weightings? The Leiden rankings are an example. They have been in existence since 2006 (Waltman et al. 2012; see also CWTS 2023), which is slightly more recent than Shanghai rankings and also adopts the Berlin Principles of 2006 for good ranking practise. Leiden rankings seem quite adaptable to new developments in the higher education environment. The natural fluctuation in the middle and lower areas of the hierarchy appears acceptable. Different indicators are not counted together. Instead there are different perspectives on universities and representations for them. Most of the Leiden indicators are ratios between the number of favourable events and the total number of events. For example, the so-called 'crown' indicator measures the ratio of publications that belong to the 10 percent most cited publications of a field to the total number of publications at a given department. As probabilities these numbers are independent of size and reflect quality better than absolute numbers. Qualified decision-makers in the Western education sector may consult Leiden rankings and thus need not overemphasise absolute numbers for ranks.
Mission accomplished
ShanghaiRanking has largely fulfilled its original task "to discover the extant gap between Chinese universities and world-class universities" (Liu and Cheng, 2007, p. 175). Some Chinese universities have achieved top positions in the Leiden rankings (CWTS 2023). For ShanghaiRanking Consultancy, such successes are confirmations through external validation.
It is about time to make further developments. The next level of skill may require some kind of cross national research cooperation. A recognised global standard is still lacking. None of the well-known university rankings takes adequately the quality and success of teaching into account. The words 'student' or 'assistant' do not even appear in Shanghai's indicator definitions, and certainly not for self-determined individuals. Any pure research ranking is simply unsuitable for a comprehensive evaluation of universities.
The case for teaching quality
In line with the educational ideal (Stichweh 1994), teaching deserves no less research effort than bibliometrics. A universal unit of measurement such as quotations in bibliometrics is currently lacking. Unbiased assessments of teaching cannot be left completely to the institutions under review (CEPES 2006). Remote monitoring would be a step towards a standardised measurement of learning experiences (Dong 2023). Otherwise, teaching recommendations are left to current fashion and all kinds of sectarian ideas. Unfortunately, this appropriation is already taking place via administrations, at least in parts of Western academies (Gut 2024; see also Graf 2024).
Further steps
Currently, only certain issues with the data and the design of Arwu indicators have been addressed. Further reservations of this nature could be pointed out. For example, across all years, an astonishing 96 percent of all institutions made it back onto the top 100 list in the following year. In reality, this figure should be significantly lower if the stabilisers were used as defined.
Next, the scoring mechanism with standardisation and aggregation of indicators comes into focus. There are various better, demonstrably more functional options for its components. For this reason, Arwu's algorithmic core can be significantly improved. A follow-up article to address this more productive topic is being considered here.
The rankings from Shanghai will probably remain as static as ever by ignoring essential quality steps. Instead, the well-informed attitude towards such unsolicited services seems to be changing (University of Zurich 2024) and even once becoming a little less serious – Let's twist again (Checker 1961). The lyrics fit perfectly with this theme – enjoy.
Conclusion
The original task of Shanghai rankings was to help place Chinese universities in the league of the 100 best-known universities. This primary goal has been achieved by several universities and repeatedly confirmed by independent rankings.
Numerous errors have recently been discovered in the indicator data available to the public from ShanghaiRanking Consultancy. These can be categorised as inconsistencies with expectations, non-compliance with definitions, data omissions and data fabrication.This is not simply a question of many different mistakes, but it also reveals a lack of functionality in the underlying design.
Taken together, this suggests unauthorised interventions at data level.
The errors have manifested themselves as unexpected patterns where only chance should be effective. The difference patterns exist independently of theory and are based entirely on Arwu data. They are 100 percent reproducible using just spreadsheets.
Dubious ranking decisions are an inevitable consequence of the underlying incorrect data. Chinese universities were hardly affected because they only appeared in the top 100 list a few years ago. The patterns mainly concern traditional Western universities. The fact that the indicators in question have not been diversified or further developed for more than 20 years is another reason for caution.
Two thirds of the indicators relate exclusively to cutting-edge research. As rankings of top research, Shanghai rankings were not suitable from the outset for comprehensively qualifying universities worldwide.
Our analysis has the potential to discredit this and other similar rankings. That was by no means intentional; we are merely stating facts. All data used have been published annually for more than two decades and are freely available.
As a result, the international use of Shanghai rankings has become generally inadvisable, except perhaps for basic research. Alternatives are available and are being developed step by step.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original authors and the source, provide a link to the Creative Commons licence, and indicate if changes were made. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ .
References
APA - American Psychological Association (2020). Concise guide to APA style. The official style guide for students. (7th ed.). Washington, DC: American Psychological Association.
Bjørk, R. (2019). The age at which Nobel Prize research is conducted. Scientometrics, 119, 931–939.
CEPES - Institute for Higher Education Policy (2006). Berlin Principles on ranking of higher education institutions. https://www.ihep.org/wp-content/uploads/2014/05/uploads_docs_pubs_berlinprinciplesranking.pdf . Retrieved 21 June 2024.
Checker, Ch. (1961). Let's twist again. Philadelphia: Parkway. https://youtu.be/eh8eb_ACLl8?si=HtEYJtbfO6JfxNnb . Retrieved 22 Sept. 2024.
Clarivate (2023). Highly Cited Researchers™ Clarivate PLC, London. https://clarivate.com/highly-cited-researchers/ . Retrieved 21 June 2024.
CWTS - Centre for Science and Technology Studies (2023). CWTS Leiden ranking. www.leidenranking.com/ranking/2023/list . Retrieved 21 June 2024.
Docampo, D., Cram, L. (2014) On the internal dynamics of the Shanghai ranking. Scientometrics, 98, 1347–1366.
Docampo, D., Egret, D., & Cram, L. (2022). An anatomy of the academic ranking of world universities (Shanghai ranking). SN Social Sciences 2: 146, 1–17. DOI: https://doi.org/10.1007/s43545-022-00443-3 . Retrieved 21 June 2024.
Dong, Y. (2023). Teaching Quality Monitoring and Evaluation in Higher Education through a Big Data Analysis. International Journal of Emerging Technologies in Learning (iJET), 18(08), 61–78.
Fernández-Cano, A., Curiel-Marin, E., Torralbo-Rodríguez, M., & Vallejo-Ruiz, M. (2018). Questioning the Shanghai Ranking methodology as a tool for the evaluation of universities: an integrative review. Scientometrics, 116, 2069–2083.
Florian, R. V. (2007). Irreproducibility of the results of the Shanghai academic ranking of world universities. Scientometrics, 72, 25–32.
Freyer, L. (2014). Robust rankings. Review of multivariate assessments illustrated by the Shanghai rankings. Scientometrics, 100, 391–406. DOI: https://doi.org/10.1007/s11192-014-1313-8 . Retrieved 21 June 2024.
Lilienfeld, S. O., Wood, J. M., & Garb, H. N. (2000). The Scientific Status of Projective Techniques. Psychological Science in the Public Interest, 1(2), 27–66.
Graf, D. (2024). "Woke" genug? Universität Basel wegen "Gesinnungstest" in der Kritik. 20min. https://www.20min.ch/story/basel-woke-genug-universitaet-basel-wegen-gesinnungstest-in-der-kritik-103068728 . Retrieved 21 June 2024.
Gut, Ph. (2024). Uni Basel führt Gesinnungstest ein. Weltwoche. https://weltwoche.ch/story/uni-basel-fuehrt-gesinnungstest-ein/ . Retrieved 21 June 2024.
Liu, N. C., Cheng, Y. (2007). Academic ranking of World universities: Methodologies and problems. In: Sadlak, J., Liu, N. C. (Eds.), The World-Class University and Ranking: Aiming Beyond Status, 175–188. Bucharest: Cluj University Press.
ShanghaiRanking (2023a). About Academic Ranking of World Universities. https://www.shanghairanking.com/about-arwu . Retrieved 28 July 2024.
ShanghaiRanking (2023b). Academic Ranking of World Universities Methodology 2023. http://shanghairanking.com/methodology/arwu/2023 . Retrieved 28 July 2024.
OECD Environment, Health and Safety Publications (2021). Advisory Document on GLP Data Integrity. Series on Principles of Good Laboratory Practice (GLP) and Compliance Monitoring, No. 22. https://mobil.bfr.bund.de/cm/349/nr-22-oecd-advisory-document-of-the-working-party-on-good-laboratory-practice-on-glp-data-integrity.pdf . Retrieved 21 June 2024.
Sharvit, Y. (2022). Data-Oriented Programming. Shelter Island: Manning Publications Co.
Stichweh, R. (1994). The Unity of Teaching and Research. In: Poggi, S., Bossi, M. (Eds.), Romanticism in Science. Boston Studies in the Philosophy of Science, vol 152. Dordrecht: Springer.
Tofallis, Ch. (2012). A different approach to university rankings. Higher Education, 63(1), 1-18.
University of Zurich (2024). Research evaluation on rankings. https://www.openscience.uzh.ch/en/moreopenscience/researchevaluation/Rankings.html . Retrieved 21 June 2024.
Waltman, L. R., Calero Medina, C. M., Kosten, J., Noyons, E. C. M., Tijssen, R. J. W., Eck, N. J. P. van, … Wouters, P. (2012). The Leiden Ranking 2011/2012: Data collection, indicators, and interpretation. Centre for Science and Technology Studies, Leiden University. https://hdl.handle.net/1887/19353 . Retrieved 21 June 2024.
Walz, G. (Ed.). (2004). Lexikon der Statistik (1st ed.). Munich: Spektrum.
Wolfram Research (2007). MedianDeviation, Wolfram Language function (updated 2023). https://reference.wolfram.com/language/ref/MedianDeviation.html . Retrieved 21 June 2024.
Wolfram Research (2010). DistributionFitTest, Wolfram Language function (updated 2015). https://reference.wolfram.com/language/ref/DistributionFitTest.html . Retrieved 21 June 2024.
Wolfram, S. (2002). A new kind of science. Champaign: Wolfram Media, Inc.
Wolfram, S. (2022). Metamathematics: Foundations & Physicalization. Champaign: Wolfram Media, Inc.
Wolfram, S. (2023). An Elementary Introduction to the Wolfram Language, Third Edition. Champaign: Wolfram Media, Inc. https://www.wolfram.com/language/elementary-introduction/3rd-ed/ . Retrieved 01 July 2024.
November 18, 2024 / LF
Mission accomplished.
Impressum: Leo Freyer fecit
using green.ch's website builder.