Critical remarks on the Italian research assessment exercise

For nearly a decade, several national exercises have been implemented for assessing the Italian research performance, from the viewpoint of universities and other research institutions. The penultimate one – i.e., the VQR 2004-2010, which adopted a hybrid evaluation approach based on bibliometric analysis and peer review – suffered heavy criticism at a national and international


Introduction and literature review
In the latter 10-20 years, a growing number of countries have been implementing national exercises for assessing the performance of research institutions, with five key objectives (Schotten and El Aisati, 2014;Abramo and D'Angelo, 2015): 1. Guiding merit-based allocation of public funding; 2. Stimulating continuous improvement in research productivity, through comparative analysis of performance; 3. Identifying the strengths and weaknesses in disciplines and geographic areas, so as to support formulation of research policy and management strategies at a governmental and institutional level; 4. Providing convincing information to tax payers on the effectiveness of research management and delivery of public benefits; 5. Reducing the information asymmetry between knowledge users (i.e., students, enterprises, and funding agencies) and suppliers (i.e., individual scientists).
Although the shares of overall public funding and the criteria for assigning them tend to vary from nation to nation, the number of countries that conduct regular comparative performance evaluations of universities and link the results to public financing seems to increase gradually (Hicks, 2012). In a nutshell, the VQR 2004-2010 was a hybrid type of evaluation exercise, based primarily on bibliometric analysis for the so called bibliometric areas (i.e., hard sciences) and on peer review for the so called non-bibliometric ones (i.e., social sciences and humanities). For details, see (ANVUR, 2011;Ancaiani et al., 2015).
Since the time of its introduction, the VQR 2004-2010 had been receiving heavy criticism by part of the Italian scientific community. One of the targets of this criticism was the mechanism for determining the merit class of scientific papers, i.e., a bibliometric assessment combining (i) the number of citations obtained and (ii) a metric of the journal impact (ISI Impact Factor or similar ones) publishing the papers examined. Some Italian scientists considered the criteria by ANVUR as the product of Do-It-Yourself Bibliometrics 1 (ROARS, 2016), for their anachronistic disregard of the basic rules of this discipline. Several Italian bibliometricians expressed similar criticism to the attention of the international scientific community (Abramo and D'Angelo, 2015;Baccini and De Nicolao, 2016;Geuna and Piolatto, 2016).
After the "stormy" VQR 2004-2010, a new assessment exercise, denominated VQR 2011 been recently implemented and is still in progress. Despite the criticism to the evaluation criteria of the VQR 2004-2010, the architecture of the new exercise is rather similar to that of the previous one. The most noticeable difference is the new criterion for determining the merit class of the papers examined. Details on this and other differences are contained in the conference paper by Anfossi et al. (2015), later published in extended form on a Scientometrics special issue (Anfossi et al., 2016). As with the VQR 2004-2010, the VQR 2011-2014 has also been receiving heavy criticism (ROARS, 2016).
The aim of this paper is to discuss the current research assessment exercise, collecting and organizing some critical arguments directed to the previous exercise and developing other arguments in detail. The discussion will address the conceptual/methodological aspects of the VQR 2011-2014, without investigating the practical implications of this exercise for the future of research in Italy.
The remainder of the paper is organized into three sections. Sect. 2 recalls the main features of the VQR 2011-2014 and provides a simplified description of the relevant bibliometric criteria, so as to prepare the ground for understanding the subsequent analysis. Sect. 3, which represents the core of this paper, discusses in detail five vulnerabilities of the VQR 2011-2014; description is supported by several pedagogical examples. Sect. 4 summarizes and comments the main findings of our critical analysis.

Description of the VQR 2011-2014
This section presents a "pedagogical" description of the current Italian assessment exercise (VQR 2011(VQR -2014, which is propaedeutic for understanding the contents of Sect. 3.
As mentioned in Sect. 1, the VQR 2011-2014 represents the "third act" of research assessment exercises in Italy. The purpose of this exercise is to evaluate the research activity carried out over the 2011-2014 period in public universities, legally recognized private universities and other research institutions under the responsibility of the MIUR. Apart from research institutions, objects of the evaluation are their macro-disciplinary areas and departments but not individual researchers.
The results may influence two areas of future action: (1) overall institutional evaluations will guide allocation of the merit-based share of the so-called Ordinary Finance Funds (FFO), i.e., the core government funding for Italian universities; (2) evaluation of the macro areas and departments can be used by research institutions to guide internal allocation of the acquired resources.
The evaluation of the whole institutions is determined by the weighted sum of a number of indicators: 75% based on a score for the quality of the research output and 25% derived from a composition of other indicators (capacity to attract resources, mobility of research staff, internationalization, Ph.D. programs, etc.).
Let us now focus the attention on the evaluation of the so-called research products, namely articles, books, book chapters, conference proceedings, critical reviews, commentaries, book translations, patents, prototypes, project plans, software, databases, exhibitions, works of art, compositions and thematic papers. The term "product" is used in the official ANVUR documents, indicating entities of different nature. Since our study will consider almost exclusively articles in scientific journals, conference proceedings and book chapters, this term will be hereafter replaced with the terms "paper", "article" or "publication".
ANVUR nominated 16 evaluation panels, i.e., the so-called Groups of Evaluation Experts (GEVs), including national and foreign experts, one for each research area composing the national academic system (details on the research areas and relevant GEVs are reported in Tab. A1, in the appendix).
The institutions subject to evaluation should submit a specific number of papers for each researcher with a permanent position, based on his/her academic rank and period of activity over the four years considered. Simplifying, the requirement for university staff is two papers per researcher, whereas that for other research institutions is three papers per researcher. The papers were then submitted to the appropriate GEVs based on the researcher's identification of the more pertinent research areas for them (ANVUR 2015a;2015b).
The 16 research areas, are divided into bibliometric and non-bibliometric ones, depending on their peculiarities (see Tab. A1, in the appendix). In the latter ones (i.e., typically social sciences and humanities) papers are evaluated exclusively through peer review, while in the former ones (i.e., typically hard sciences, such as engineering and life sciences) papers are evaluated using a mixed approach consisting of bibliometric analysis, for those indexed by Scopus and WoS, and peer review for the other papers or even for the indexed papers, when expressly requested by the institution.
Consistently with the Ministerial Decree of 27 June 2015 by MIUR (2015), the (bibliometric or peer-review) evaluation of the quality of each paper should result into five merit classes (A, B, C, D and E), as described in Tab. 1.
Tab. 1. Classes of merit and relevant score, in which papers evaluated are classified.

Class
Score (S i ) Description A. Excellent 1 The paper places in the top 10% of the so-called "distribution of the international scientific production" 2 , for the specific area of interest and issue year. B. Good 0.7 The paper places in the top 10-30% range of the same distribution. C. Fair 0.4 The paper places in the top 30-50% range of the same distribution. D. Acceptable 0.1 The paper places in the top 50-80% range of the same distribution. E. Limited 0 The paper places in the bottom 20% of the same distribution or cannot be evaluated because it does not conform to the types of acceptable papers.
The institutions are also subject to potential penalties: (i) in proven cases of plagiarism or fraud, (ii) for paper types not admitted by the GEV, or lack of relevant documentation, or produced outside the 2011-2014 period, and (iii) for failure to submit the requested number of papers.
We now focus the attention on the paper evaluation in the bibliometric areas. Simplifying, each research institution submits the papers to be evaluated, specifying (1)  science categories -as defined by Scopus Elsevier (for simplicity, both these groups of categories will be hereafter referred to as SC), among those associated to the publishing journals, and (2) the most pertinent GEV panels.
Two indicators are associated with each i-th paper: the citation count (C i ), i.e., the number of citations accumulated by the paper up to a given point in time ( Indicator obtained weighing the citations received by the articles (in a specific time period), depending on the rank of the relevant journals, i.e., citations from highly ranked journals are weighted to make a larger contribution than those from poorly ranked journals. This journal metric can therefore be considered as normalized on the basis of the prestige of citing journals. AI is pre-calculated based on the WoS citation statistics. SCImago Journal Ranking (SJR) Indicator similar to AI but pre-calculated according to the citation statistics by Scopus.
Source Normalized Impact per Paper (SNIP) Indicator similar to IPP but normalized based on the different citation propensity of the citing articles. SNIP is therefore a field-normalized indicator, pre-calculated using the Scopus citation statistics.
Having determined the reference database and the journal metric to be used, the evaluation procedure concerning each i-th paper is based on the following steps:  Definition of an aggregate indicator, given by the linear combination of F J (J i ) and F C (C i ): where w[0, 1] is a weight used for giving more/less importance to the F C (C i ) and F J (J i ) contributions, in their aggregation by a weighted sum 4 .
The choice of the w value is left to the GEV. In general, ANVUR (2015aANVUR ( , 2015bANVUR ( , 2015c recommends to use relatively higher w values for older articles (e.g., those issued in 2011-2012), as they are likely to be mature enough in terms of citation impact. On the other hand, it recommends to use relatively lower w values for more recent articles (such as those issued in 2014), in order to give more weight (i.e., 1w) to the journal metric, which is used as a proxy of the future impact of these articles.  For each combination of SC and issue year, the distribution of the Y i values is supposed to represent the (so-called) "distribution of the international scientific production" (MIUR, 2015).
Consistently with what reported in Tab. 1, papers can be classified into the five merit classes, the papers positioned in the remaining zones (highlighted in grey) can be assigned by GEVs to the classes that they considered as appropriate or can be subject to an additional informed peer-review procedure. For details, see (ANVUR, 2011;Abramo et al., 2015;Ancaiani, 2015). On the other 4 We remark that Y i and w are not explicitly defined in the official documents by ANVUR (2015a;2015b;2015c), which hint at partitioning of the F C -F J space into sub-regions delimited by parallel lines (i.e., with same slope), defined by equations: where A is the (fixed) slope of the lines and B n is the relevant angular coefficient. Comparing Eq. 1 with Eq. n1, we obtain: Thus, setting A corresponds to setting w uniquely, while setting a B n value corresponds to setting a Y i value uniquely.
hand, the VQR 2011-2014 adopts a technique based on partitioning the F J (J i )-F C (C i ) plane into oblique stripes, as shown in Fig. 1(b).

Critical analysis
This section is divided into five subsections, dealing with the major vulnerabilities of the bibliometric evaluation procedure of the VQR 2011-2014; a synthetic description of these vulnerabilities is reported in Tab. 3).
Tab. 3. Brief description of the major vulnerabilities of the bibliometric evaluation procedure in the VQR 2011-2014.

Vulnerabilities
Brief description 1. Evaluation of a small number of papers.
Even assuming that the (bibliometric and non-bibliometric) evaluation procedure is methodologically impeccable, the evaluation of just two/three papers per researcher represents a serious limitation for assessing the performance of research institutions. 2. (Mis)use of journal metrics.
Using journal metrics (even when combined with other indicators) to evaluate the quality of individual papers is potentially misleading. 3. Normalization/combination of indicators.
The normalization of C i and J i through the F C and F J percentile ranks, their subsequent aggregation into Y i , and the normalization of Y i through the F Y percentile rank are conceptually questionable operations. 4. Decisional autonomy to GEVs. Several operations of "calibration" of the metrics (e.g., setting w, choosing the more appropriate journal metric, etc.) are entrusted to GEVS; in the absence of solid guidelines, this freedom can be counterproductive. 5. Compatibility between peer review and bibliometric analysis.
According to the VQR 2011-2014, the output of the bibliometric and peer-review evaluation should be mutually compatible. This assumption does not seem to be supported by adequate empirical evidence.

Evaluation of a small number of papers
As anticipated, the VQR 2011-2014 evaluates a relatively small number of papers per researcher, i.e., two or three. This limitation -which generally characterizes peer-review based exercises, due to the considerable effort required to read and (manually) evaluate the examined papers -may represent a critical concern for the reliability of results; Abramo et al. (2014) justly state that it could be reasonable to extend the evaluation to the totality of the papers produced, at least for bibliometric areas.
In light of the previous considerations, a question arises: Which research-performance features can the VQR 2011-2014 depict? Proceeding by elimination, we believe that this exercise does not allow to depict productivity, due to the relatively low number of papers evaluated. Also, it does not seem appropriate to assess the average quality/impact of the research, since it ignores a significant portion of the papers produced during the evaluation period. It does not even seem appropriate to assess the research excellence, defined as the ability to produce high-level research with a certain regularity (Franceschini and Maisano, 2011); in fact, the production of two high quality/impact papers in four years does not seem a sufficient condition to prove the excellence of a generic researcher (one swallow does not make a summer).
Let us present a simple numerical example to clarify the last point: consider a generic mid-level researcher (X) with a scientific production in line with the so-called "distribution of the international scientific production". We hypothesize that this researcher is able to produce about three papers per year, therefore, about 12 papers in the 2011-2014 time window. Consistently with the information contained in Tab. 1, only 10% of the papers will (on average) achieve the highest class (A), while only 30% will (on average) achieve class A or B. The probability that this (midlevel) researcher has at least two papers of class A or B will be: Let us consider a second excellent researcher (Y), who is able to produce about 12 papers (in the same time window), all of which of class A or B. Researcher Y will obviously have at least two papers of class A or B (i.e., Pr = 100%).
The previous example shows that, in spite of the obvious superiority of the excellent researcher (Y), even the mid-level one (X) has a very high probability (91.5%) to have at least two papers of class A or B (see also the chart in Fig. 2). It therefore is very difficult to discriminate between these two researchers when considering two papers only. As an alternative example, it is trivial to demonstrate that it would be impossible to discriminate between a researcher with two-and-only-two papers of class A and a researcher with a plethora of papers of class A.

Fig. 2. Graph showing the probability (Pr) of a researcher to have at least two articles of class A or B
, assuming that he/she has produced 12 papers, each with a probability (p) to be in these classes; Pr was calculated using the model in Eq. 2. For example, the (mid-level) researcher X has a probability p = 30% to produce articles of class A or B, while the (excellent) researcher Y exclusively produces papers of class A or B (p = 100%). Despite this large gap, the Pr values related to the two researchers are not much different (i.e., 91% against 100%).
In view of the fact that the assessment of entire research institutions is performed by aggregating the contributions from individual researchers, our considerations on the poor discrimination power in the identification of excellent researchers can be extended to the identification of excellent research institutions. Returning to the initial example, let us assume that there are two research institutions: the first mostly consists of mid-level researchers (as researcher X), while the second mostly consists of excellent researchers (as researcher Y). The big gap between these two populations could not necessarily be caught when applying an evaluation system based on the submission of two papers per head only. Thus, we believe that it would be unwise to use the results of the VQR 2011-2014 exercise to estimate the level of excellence of research institutions.
Having said that, a new question arises: Is there any reasonable use of the results of the proposed exercise? With some effort of imagination, it seems that the results of this exercise can only depict the level of research decency, meaning the ability to produce -in the relatively long time period of four years -a low number of papers with relatively high impact/quality. It is not unrealistic to assume that, for a research institution in which researchers are (on the average) active, it would not be so difficult to "saturate" the expected scores for the papers submitted, i.e., most researchers would be able to submit papers classified in relatively high merit classes (e.g., A or B, as also illustrated in the previous examples). Inverting the reasoning, this exercise could allow to find out institutions with relatively high incidence of "lazy" researchers, i.e., unable to produce at least two/three papers with relatively high impact, in four years. Let us clarify this through a metaphor: if the students of a middle-school class were evaluated through a very permissive test, most of them would be likely to pass it with a high score, except for the least prepared.
In conclusion, authors believe that this type of evaluation could be effective for identifying the less virtuous research institutions but could be ineffective for identifying the excellent ones.

(Mis)use of journal metrics
As previously described, the bibliometric classification of a generic i-th paper is based on the combination of the C i and J i indicators; this subsection focuses the attention on the latter one.
According to ANVUR, journal metrics can be especially useful to support the evaluation of relatively recent papers, which are not so mature in terms of citation impact ; following this reasoning, when evaluating these papers, ANVUR suggest to decrease w, in order to give more weight to J i (which is implicitly used as a proxy of the future citation impact of the papers) with respect to C i (ANVUR 2015a; 2015b; 2015c). For many years now, a large number of contributions in the scientific literature prove the diffused misuse of journal metrics for assessing individual articles (Seglen, 1997;Lozano et al., 2012;IEEE, 2013;Marx and Bornmann, 2013;Ware and Mabe, 2015); according to Van Raan, this would be a "mortal sin" (Levine, 2011). The reason, almost universally acknowledged among bibliometricians, is that the variability in the number of citations received by articles published by the same journal is generally high; as a consequence, the use of central tendency indicators -as journals metrics -is inappropriate for estimating the citation impact of individual papers. To use a metaphor, it would be like predicting the future height of a specific individual, using the average height of the population (being the human height relatively dispersed).
It matters little that the results of the national exercise will not be used to evaluate individual researchers but entire research institutions or perhaps portions of them (Ancaiani et al., 2016): the use of journal metrics remains incorrect, as it is directed to the evaluation of individual articles. It can be also said that the combination of a correct metric (C i ) with a distorted one (J i ) can only produce a new distorted metric (Y i in the case of the VQR 2011-2014).
Also, the fact of giving more merit to papers published in journals with relatively high J i values is questionable for two reasons:  Journals with higher J i values are not necessarily more stringent and rigorous in the selection of the papers to be published, also due to the diffusion of techniques for manipulating journal metrics (Martin, 2016);  Papers published on journals with higher J i values tend to have a higher propensity (on average) to be cited than papers (of similar quality) published on journals with lower J i values, due to a sort of "showcase effect" (Didegah and Thelwall, 2013;Franceschini and Maisano, 2014). It is therefore debatable that such papers should receive a further advantage.
Although we are aware of the difficulties in estimating the future citation impact of recent papers, we believe that the use of journal metrics as predictors represents an illusory and distorting solution. This is confirmed by several authoritative scientific contributions (Lett, 2013;Bohannon, 2016).
A less debatable solution could be complementing C i with the so-called altmetrics -i.e., alternative metrics related to individual papers, such as the count of the number of views, downloads, blogs, media coverage, etc. Bornmann, 2014;Costas et al., 2015); however, it is still necessary to investigate the potential of altmetrics and their benefits and disadvantages for measuring impact.

Normalization/combination of indicators
The bibliometric evaluation of individual papers is based on the normalization of the two indicators C i and J i , through the percentile ranks F C and F J , and their subsequent aggregation into Y i , through a weighted sum. Even assuming that combining C i and J i is meaningful (see the criticism in Sect. 3.2), this section shows that the proposed normalization and consequent combination is conceptually misleading. The remainder of this section is divided into five sub-sections: Sect. 3.3.1 recalls some basic properties of the scales of measurement, which are functional to the understanding of the subsequent criticism, Sect. 3.3.2 criticizes the normalization of J i and C i , Sect.
3.3.3 criticizes the combination of F C and F J , Sect. 3.3.4 criticizes the score assignment to merit classes, and Sect. 3.3.5 summarizes the criticism in Sects. 3.3.2 to 3.3.4.

Basic properties of the scales of measurement
A largely accepted classification of the scales of measurement was proposed by Stevens (1946). In this proposal, measurements/indicators can be classified into four different types of scales: nominal, ordinal, interval and ratio (see Tab. 4).
It will be convenient to illustrate this scheme with a variable X and two objects, say A and B, whose scores on X are x A and x B , respectively. 1. A nominal scale merely distinguishes between classes (equivalence relationship). That is, with respect to A and B one can only say Tab. 4. Classification scheme of measurements/indicators depending on their scale types (Stevens, 1946;Roberts, 1979 3. an interval scale assigns a meaningful measure of the difference between two objects (distance relationship). One may say not only that 4. a ratio scale is an interval scale with a meaningful zero point (which allows ratio relationship). If x A > x B then one may say that A is x A / x B times superior to B.
From the viewpoint of the scale properties, the above types of measurement scales are ordered from "less powerful" to "more powerful". In particular, the more powerful scales (interval and ratio) provide more information and are generally preferred for measurement purposes. It is often a goal of measurement to obtain scales that are as much powerful as possible, but -unfortunately -this is not always so straightforward (Franceschini et al., 2007).
As a general rule, numbers should be analysed on the basis of the properties of the scale with which they are gathered (Roberts, 1979). Consequently, one may obtain results that do not make sense by applying arithmetic operations to measurements/indicators with scales in which these operations are inadmissible (see the second column of Tab. 4).

Normalization using percentile ranks
In light of the classification in Sect. 3.3.1, C i and J i are defined on ratio scales since they both have a meaningful zero, corresponding to the absence of the measured manifestation (i.e., the citations obtained by an article or an entire journal), and an objective and precise unit. Thus, they allow relationships of equivalence, order, distance and ratio among the objects represented; the only permissible scale transformation, which preserves the above relationships among the objects, is that of similarity.
The normalizations through the percentile ranks F C , F J and F Y can be interpreted as nonnecessarily-linear monotonically increasing transformations, which turn the realizations of the variables of interest (C i , J i and Y i ) into the cumulative probabilities (or percentile ranks) of the corresponding distributions (see the example in Fig. 3). Of course, depending on the distributions of interest, the percentile rank functions (F C , F J and F Y ) will be different. Only in the special, and very unlikely, case in which C i , J i and Y i were uniformly distributed, these monotonically increasing functions would degenerate into similarity functions ((x) = a·x, being a > 0). In general, the F C , F J and F Y transformations would distort both the interval and ratio relationships among the initial objects (C i , J i and Y i values), preserving only the equivalence and order relationships (Roberts, 1979;Kreifeldt and Nah, 1995;Thompson, 1993Bornmann et al., 2013. Let us provide a practical example, considering three fictitious papers (P  , P  and P  ) published by two journals in the same SC and issue year. The three papers respectively received C  = 5, C  = 10 and C  = 15 citations. Since C i is defined on a ratio scale and is typically used to evaluate the citation impact of a paper 5 , the following statements are meaningful (see the first two columns of Tab. 6): 1. Equivalence relationship: all the three papers have different citation impact; 2. Order relationship: the citation impact of P  is higher than that of P  , which is in turn higher than that of P  ; 3. Distance relationship: the difference (in terms of citation impact) between P  and P  is equal to that between P  and P  ; 4. Ratio relationship: the citation impact of P  is twice that of P  , while that of P  is three times that of P  .
Let us now consider the empirical distribution of the C i values reported in Tab. 5, which is also represented graphically in Fig. 3(a); the percentile ranks related to the C  , C  , and C  values are F C (C  = 5) = 70.3%, F C (C  = 10) = 81.3% and F C (C  = 15) = 82.8% (see Tab. 5).
Tab. 5 -Absolute/relative frequencies and percentile ranks related to the C i values of 64 fictitious papers, published by journals in the same SC and issue year. N/A f a is the absolute frequency related to a certain C i value; f r is the relative frequency related to a certain C i value; F C is the cumulative probability (or percentile rank) related to a certain C i value.
Returning to the four previous statements about the relationship among objects, it can be seen that the application of the F C transformation does not alter the relationships for the less "powerful" scales, i.e. the categorical scales (nominal and ordinal), but it may alter the relationships among objects for the cardinal scales (i.e., interval and ratio) (see Tab. 6). In other words, the application of the F C transformation downgrades the initial (ratio) scale of C i to an ordinal scale, preserving the relationships of equivalence and order but distorting those of distance and ratio.
The above considerations can be extended to J i and Y i and the respective transformation/normalization through the F J and F Y functions.

Tab. 6. Example of statements preserved and distorted, after having applied the F C transformation function in
Tab. 5 to C  = 5, C  = 10 and C  = 15.

Relationship Initial statement
After the F C transformation Statement preserved?

Combination of F C and F J
Being defined in the same [0, 100%] range, the normalized indicators F C (C i ) and F J (J i ) may seem comparable. F C (C i ) and F J (J i ) are then combined into the synthetic indicator Y i , through a polynomial function. Anfossi et al., 2016 state that the aggregation function could be a generic polynomial function -even of order higher than one -satisfying the basic requirement of Pareto dominance; then, for the purpose of simplicity, they suggest to use linear functions as modelled in Eq. 1. It seems that this hint has been followed by most of the GEVs (ANVUR, 2015b; 2015c).
Having said that, the proposed combination of F C and F J is questionable for (at least) four reasons: 1. Although the authors share the opinion of Anfossi et al. (2016), regarding the fact that Pareto dominance would be a desirable property, they point out that any convex combination of F C and F J , like the one in Eq. 1, cannot satisfy Pareto dominance. In fact a pair (F C (C 1 ) and F J (J 1 )) is said to be Pareto dominating another pair (F C (C 2 ) and F J (J 2 )) whenever both the conditions F C (C 1 ) ≥ F C (C 2 ) and F J (J 1 ) ≥ F C (J 2 ) hold. Obviously, since Y i is a linear combination of F C and F J , there are situations in which Y 1 ≥ Y 2 but Pareto dominance does not hold. By the way, it can be noticed that the classification adopted in the VQR 2004-2011 satisfies the requirement of Pareto dominance (see Fig. 1(a)). In other words, all publications in the merit class A are Pareto dominant to all lower classes (and similarly the merit class B is Pareto dominant to C and D, etc.).
2. The aggregation model in Eq. 1 is based on the weighted sum of objects (i.e., the F C and F J percentile ranks), which are defined on ordinal scales (see Sect. 3.4.2). This aggregation is therefore prohibited (cf. Tab. 4) and conceptually misleading (Roberts, 1979); to confirm this, the scientific literature includes several contributions indicating that percentile ranks cannot be added, such as (Thompson, 1993;Kreifeldt and Nah, 1995).
3. The proposed aggregation presupposes the existence of questionable equivalence classes for the papers examined, depending on the J i and C i values.
Let us develop the fourth point with an example. Considering the C i and J i values related to 64 fictitious papers in a certain SC and issue year, we can represent them in the J i -C i plane with the relevant distributions (see Fig. 4).
By applying the empirical transformations F C (C i ) and F J (J i ) to the initial data (see the relevant columns in Tab. A2, in the appendix), the initial J i -C i plane is "deformed" into the new F J -F C plane in Fig. 5(b) (ROARS, 2016). Comparing the graphs in Fig. 5(a) and Fig. 5(b), we note that these transformations may cause an uncontrollable variation in the point positioning (numeric labels refer to the paper ID numbers reported in Tab. A2, in the appendix).
The loci of the points with the same Y i value, i.e., the so-called equivalence classes or iso-Y i contour lines, can be represented on the F J -F C plane. When adopting a linear aggregation model (like the one in Eq. 1), iso-Y i are straight lines (see also Fig. 1(b)). For the purpose of example, Fig. 6 shows four lines for the Y i values corresponding to F Y ≈ 20%, 50%, 70% and 90% respectively; in this case, w was set to 0.4.   From the perspective of the Y i indicator, two (or more) points/papers on the same oblique line (Fig.   6) are considered equivalent. Although this may sound reasonable, it is a source of possible distortions. In fact, referring to the initial scales of C i and J i , the iso-Y i contour lines have unpredictable form, as they are influenced by the empirical distributions of the C i and J i values (see the representation in Fig. 7) Fig. 5(b). The original data are reported in Tab. A2 (in the appendix). Y i has been calculated setting w = 0.4. For example, assuming that the distribution of the C i values changes into that of the C i ' values reported in Tab. A3 (in the appendix), while that of the J i values remains unchanged, the new contour lines would be deformed significantly with respect to the initial ones (see Fig. 8(a) and (b)).

Fig. 6. Iso-Y i contour lines for the F J -F C plane, relating to the data shown in
Similar uncontrolled variations can result when introducing small changes in w; for example, Fig.   8(c) represents new iso-Y i contour lines, when using w' = 0.6 instead of w = 0.4.
In light of the above observations, a new question arises: what is the rationale for considering two points laying on the same line as equivalent? We believe that there is no convincing conceptual or empirical reason that can justify this kind of equivalence. The "instability" related to the equivalence classes is simply a negative consequence of the above-described improper aggregation of C i and J i .
It can also be noticed that the substitution rate between C i and J i -defined as the rate at which the C i value can be increased/decreased in exchange for a decrease/increase in the J i value, maintaining the same Y i value -is not constant. The example in Fig. 7 shows that -for a generic iso-Y i line, e.g., that in the borderline between the A and B merit classes -identical variations in J i (e.g., J i '= J i ''= 0.5) may correspond to very different variations in C i (i.e., C i ' 4.5  C i '' 0.5) and vice versa. In other words, the substitution rate is not constant over the

Score assignment to merit classes
Having associated each paper with a Y i value, the corresponding percentile rank F Y (Y i )[0, 100%] can be determined. Consistently with the description in Sect. 3.3.2, this operation may distort the distance relationships (if any) among the initial Y i values, generating a new indicator F Y (Y i ) that only preserves the equivalence and order relationships. Next, each paper receives a score (S i ) depending on the merit class related to the relevant F Y (Y i ) values; this operation can be represented graphically through the function in Fig. 9, which is weakly monotonically increasing. This transformation further degrades the scale of F Y (Y i ) to another ordinal scale (of S i ) with much lower resolution, due to the limited number of levels (i.e., five only); e.g., papers with different Y i and therefore different F Y (Y i ) values can be mapped into the same merit class, obtaining the same S i score.  This other mapping function is questionable for three reasons: 1. The order relationships among papers with different Y i values but in the same merit class are partly lost; 2. The S i score assigned to each class is purely conventional and therefore arbitrary; 3. The scores related to papers from the same institution are then summed up; this operation is not permissible for indicators defined on ordinal scales (cf. Sect. 2). In other words, the initial ordinal scale of S i is unduly promoted to a cardinal one (interval or ratio scale).
Despite the criticism reported in this sub-section, we understand that the questionable discretization of the Y i percentile ranks into corresponding classes is an operation that ANVUR was forced to Critical issues Short description 1. Normalization of C i and J i using F C and F J . These operations downgrade the initial ratio scales of C i and J i to ordinal scales (i.e., F C and F J ). 2. Aggregation of F C (C i ) and F J (J i ) through a weighted sum.
This operation, which is prohibited for indicators defined on nominal or ordinal scales, has some distorting effects: -unpredictable equivalence classes iso-Y i ; -unpredictable and variable substitution rate between C i and J i .
This operation may distort the distance relationships (if any) among the initial Y i values. 4 Score assignment to the (initial) merit classes. This transformation deteriorates the resolution of the F Y indicator. The aggregation of the S i scores by a sum is incorrect, as these scores are defined on an ordinal scale.
The authors are aware that defining adequate indicators is a difficult task (Franceschini data et al., 2007); nevertheless, they believe that the bibliometric evaluation process of the VQR 2011-2014 contains too many questionable operations. Also, even if (erroneously) deciding to combine C i and J i , we believe that this could be done avoiding dubious transformations/normalizations that alter the scales of the initial data.

Decisional autonomy to GEVs
A presumed improvement of the VQR 2011-2014 with respect to the previous exercise is the increased decisional autonomy to the panel of experts (GEVs), in defining some parameters/indicators related to the bibliometric evaluation procedure Benedetto and Setti, 2016;Anfossi et al., 2016). In our opinion, in the absence of solid and reasonable guidelines, this autonomy may sound like "abandoning GEVs to their fate". Our concerns stem from two different reasons: first, since it is (implicitly) assumed that GEV members necessarily have specialized competences in bibliometric evaluation in their research areas (Abramo and D'Angelo, 2015), and secondly, since several operations of selection and "calibration" of the metrics may be tricky, even assuming that GEV members really have those competences.
Although much will depend on how GEVs will work and the assistance that they will receive by ANVUR, we believe that three potentially tricky operations are: 1. Selection of appropriate journal metrics (see Tab. 2), to be combined with C i for the bibliometric evaluation of the papers in a certain SC and issue year. The GEVs' freedom to choose between different types of journal metrics seems pointless: given that the C i values are neither field-normalized nor normalized according to the scientific reputation of the citing papers (Franceschini and Maisano, 2014), it is "asymmetric" to combine them with journal metrics implementing a field normalization (such as SNIP) or a normalization based on the reputation of the authors (such as SJR or AF). For this reason, we are quite surprised to read some statements by presumed experts stating that a certain journal metric is "totally inadequate", while another one is appropriate for a bibliometric evaluation of the papers presented in a certain area 6 (ANVUR, 2015b, page 13).
2. Choice of the weight (w) to be used when aggregating the F C and F J values, through the model in Eq. 1. Given the conceptual problems highlighted in Sect. 3.3, choosing the "right" value of w seems rather adventurous. One of the obstacles is the uncontrollability of the substitution rate between the C i and J i indicators, as discussed in Sect. 3.3.3. ANVUR does not provide precise guidelines for choosing the values of w, probably because it would be very difficult to formulate them. The only indication 7 is that, for the more recent papers, it would be appropriate to give greater weight to J i than C i .
3. In cases of wide discrepancy between the C i and J i values (see the grey areas in Fig. 1(b)), GEVs may decide to complement the automatic classification of papers with an additional informedpeer-review assessment 8 (ANVUR, 2015b; Anfossi et al., 2016). This probably makes sense for papers with low C i and high J i values, since they can be seen as papers of little impact with the sole merit of being published in journals generally containing papers of high impact. Conversely, it does not seem reasonable that papers with high C i and low J i values are re-assessed, as they have the merit of having achieved a relatively high impact, although being part of off-peak journals. Moreover, the right to "amend" the result of the bibliometric classification through the informed-peer-review assessment seems a further way to reduce the repeatability and increase the subjectivity of the whole evaluation process. 6 A document describing the evaluation criteria that GEVs are going to use for the "Mathematics and Computer Science" area (ANVUR, 2015b, page 13) reports (translated from Italian): We excluded the IF and IPP because it was verified that the indications provided by pure impact indicators, i.e., non field-normalized (SNIP) or calculated without a selection of the journals in the area of interest, are totally inadequate to measure the impact of the journals in that area. 7 The choice of the slope of the lines should be left to the panels, since it imposes the relative weight of citations and journal metrics.
[…] It is therefore possible to assign more relevance to one of the two dimensions depending on, say, the year of publication or the citation habits of specific disciplines (Anfossi et al., 2016, page 676). 8 The basic concept of informed peer review is that a judicious application of specific bibliometric indicators and other data concerning the papers examined (e.g., abstract, brief description, any awards/reviews received by these papers, ORCID of the co-authors, etc.) may inform the process of peer review, depending on the exact goal and context of the assessment. According to Moed (2007), both metrics and peer review have their strengths and limits. The challenge is to combine the two methodologies in such a way that the strengths of the first compensates for the limitations of the second and vice versa. However, it matters a lot exactly which forms of peer review and which specific dimensions of peer review are being related to exactly which bibliometric indicators. It is also important to define exactly how these bibliometric indicators are being measured and on the basis of which data sets. Bibliometric measures ought not by definition to be seen as the objective benchmark against which peer review is to be measured (Wouters et al., 2015, page 65).

Compatibility between peer review and bibliometric analysis
A very delicate point of the VQR 2011-2014, which has been inherited from the VQR 2004-2010, is the presumed "interchangeability" between the assessment through bibliometric indicators and that through peer review, for bibliometric areas. As described in Sect. 2, researchers in these areas may choose the type of evaluation for each of the papers submitted. Moreover, some papers subject to bibliometric assessment may be evaluated through an additional informed-peer-review assessment (ANVUR, 2015a;2015b;2015c).
According to some bibliometricians, the problem of the correlation between the results of the bibliometric evaluation and those of the peer review process is controversial and, to date, the alignment between the results of peer review and bibliometric analysis is still an open question . ANVUR declares the importance of this presumed correlation for the effectiveness of hybrid research evaluation exercises like the VQR, and claims that the previous VQR 2006-2011 met this requirement (Bertocchi et al., 2016). On the other hand, Baccini and De Nicolao (2016a;2016b) argue that the results related to the VQR 2004-2010 show a rather poor correlation, except in a specific area (i.e., Economics); they also argue that, in this specific case, results of the peer review were influenced by those of bibliometric evaluation, leading to abnormally high correlation.
3. Misleading normalization and composition of C i and J i . These operations may cause additional distortion and lead to the classification into doubtful and not very controllable merit classes.
In light of the arguments gathered and developed in this paper, we are doubtful whether the whole procedure -once completed thanks to the participation of tens of thousands of individuals, including evaluation experts, researchers, administrative staff, government agencies, etc. -will lead to the desired results, i.e., providing reliable information to rank universities and other research institutions, depending on the quality of their research. We understand the importance of national research assessment exercises for guiding strategic decisions, however, we believe that the VQR 2011-2014 has too many vulnerabilities that make it unsound and often controversial.
We believe that the major vulnerabilities of the VQR 2011-2014 can be (at least partly) solved by (1) extending the bibliometric evaluation procedure to the totality of the papers, (2) avoiding the use of journal metrics in general, and (3) avoiding questionable normalizations/combinations of the indicators in use. It might also be appropriate to introduce consolidated indicators that allow practical comparisons of papers from different areas, such as the so-called "success indicators" (Franceschini et al., 2013;Bornmann and Haunschild, 2016;Rousseau and Rousseau, 2016).
Finally, the introduction of the so-called altmetrics could be a way to solve (at least partly) the old problem of estimating the impact of relatively recent articles, without (mis)using journal metrics. Tab. A2. Data concerning C i , J i and other indicators related to 64 fictitious papers of a specific SC and issue year. Y i is calculated using the relationship in Eq. 1, having set w = 40%. J i is the value of journal metric related to the publishing journal of the i-th paper; J i -rank is the corresponding rank position, having sorted the (64) papers of interest increasingly with respect to their J i values. F J (J i ) is the corresponding cumulative probability, considering the distribution of the (64) J i values available; C i is the number of citations accumulated by the i-th paper; C i -rank is the corresponding rank position, having sorted the (64) papers of interest increasingly with respect to their C i values. F C (C i ) is the corresponding cumulative probability, considering the distribution of the (64) C i values available; Y i is a composite indicator combining C i and J i , according to Eq. 1; Y i -rank is the corresponding rank position, having sorted the (64) papers of interest increasingly with respect to their Y i values. F Y (Y i ) is the corresponding cumulative probability, considering the distribution of the (64) Y i values available; The merit class of each i-th paper depends on the relevant F Y (Y i ) value, according to the conventions in Tab. 1.

Tab. A3. Data concerning C i ', J i , and other indicators related to 64 fictitious papers of a specific SC and issue
year. The J i values are the same ones reported in Tab. A2, while the C i ' values replace the corresponding C i ones. Y i is calculated using the relationship in Eq. 1, having set w = 40%. J i is the value of journal metric related to the publishing journal of the i-th paper; J i -rank is the corresponding rank position, having sorted the (64) papers of interest increasingly with respect to their J i values. F J (J i ) is the corresponding cumulative probability, considering the distribution of the (64) J i values available; C i is the number of citations accumulated by the i-th paper; C i -rank is the corresponding rank position, having sorted the (64) papers of interest increasingly with respect to their C i values. F C (C i ) is the corresponding cumulative probability, considering the distribution of the (64) C i values available; Y i is a composite indicator combining C i and J i , according to Eq. 1; Y i -rank is the corresponding rank position, having sorted the (64) papers of interest increasingly with respect to their Y i values. F Y (Y i ) is the corresponding cumulative probability, considering the distribution of the (64) Y i values available; The merit class of each i-th paper depends on the relevant F Y (Y i ) value, according to the conventions in Tab. 1.
Tab. A4. Data concerning C i , J i , and other indicators related to 64 fictitious papers of a specific SC and issue year. C i and J i values are the same ones reported in Tab. A2, while Y i is calculated (using the relationship in Eq. 1), having set w = 60%. J i is the value of journal metric related to the publishing journal of the i-th paper; J i -rank is the corresponding rank position, having sorted the (64) papers of interest increasingly with respect to their J i values. F J (J i ) is the corresponding cumulative probability, considering the distribution of the (64) J i values available; C i is the number of citations accumulated by the i-th paper; C i -rank is the corresponding rank position, having sorted the (64) papers of interest increasingly with respect to their C i values. F C (C i ) is the corresponding cumulative probability, considering the distribution of the (64) C i values available; Y i is a composite indicator combining C i and J i , according to Eq. 1; Y i -rank is the corresponding rank position, having sorted the (64) papers of interest increasingly with respect to their Y i values. F Y (Y i ) is the corresponding cumulative probability, considering the distribution of the (64) Y i values available; The merit class of each i-th paper depends on the relevant F Y (Y i ) value, according to the conventions in Tab. 1.