Carlo Debernardi

Peer review as a measurement tool

05/07/2025

The inner dynamics of academia always obsessed me. That is why I was immediately curious when I saw this recent article by Arvan, M., Bright, L. K., & Heesen, R. (2025) on crowdsourced peer-review being published. The authors put forward a version of the Condorcet Jury Theorem and essentially use that to argue in favour of an open online platform in which people could provide quantitative feedback on preprints. These numbers could in turn be used to provide an estimate of the quality of a research piece.

This system – trivializing the argument for the sake of brevity – would allow us to:

have more than the classical two/three reviews for thoe manuscripts that gather more attention from the public (and more feedback, i.e. data points, are better, right?);
reduce the duplication of review work induced by the combination of secret reviews with the practice of switching journal after a rejection.

This post is just a way to collect the scattered and random thoughts this reading prompted.

Research quality?

First, we are talking research quality, yeah, whatever that means. The authors argue in favour of an intersubjective notion of quality^[1], and I very much agree it is probably our best shot at trying to define something like it. No point in chasing the Holy Grail when what we really need is some reasonable and actionable insight.

Doesn't this already exist?

Yes, there are already platforms out there allowing a public debate of articles and preprints; think of alphaXiv or PubPeer, but I'm sure there are many others. A completely different topic is that people do not have an habit to interact via these tools. Partially because of a lack of institutional incentives towards anything different from churning out papers. But then again, institutional incentives are not aligned with a reasonable way to do research in general. And since these are the real problems, I’m not going to discuss them here. I have no general solution in mind, and we are probably doomed anyway.

The main difference is that these platforms provide a venue to discuss research but do not explicitly attempt a quantification of its quality. I'm not 100% sure this is actually feasible in a sensible way, but still, we are already doing it in a not so-sensible way with the current system, and I think the attempt to do better is worth the effort.

Implementation details

Of course with a similar platform there is potential for abuse in the form of review bombing, trolling and the like. The authors acknowledge that, there are ways to mitigate the issue, and I think it’s not the most exciting part of the debate. What I think is more stimulating is trying to imagine the details of implementation for such a platform, and the challenges it would pose.

For example, a prominent feature of the current peer review system is anonymity in its various forms. Naturally, in most cases this is merely formal, since many research communities are relatively small, and the guesswork required to identify the authors or reviewers for a manuscript is not a tough one.

On such a platform hiding the identity of the authors would be simply pointless; the system is explicitly orbiting around preprints. This open the door to unfair advantages for promiment scholars that would just benefit from their reputation, but I'm frankly more concerned about the possibility of retaliation against (especially early career) scholars expressing criticism. Hiding the identity of reviewers though would definitely be possible. Imagine just a switch or a checkbox setting.

Still, you would not be 100% completely anonymous; the system would link your profile to your review (otherwise peolpe could for example leave an unlimited number of reviews), but your name or identifier would not show up for other users to see. And, if you think about it, this is exactly how currently peer review works. The identity of reviewers and authors is available in whatever editorial management system is used by journals, and usually such information is in principle accessible by the editor and other people responsible for the handling of the journal.

Another potentially interesting turn of this proposal is the possibility of interaction with the system currently in place. It would be pretty easy to implement a functionality allowing journal editors and such to invite reviewers, which would then provide public feedback (possibly anonymously, but still disclosing their identity to the person inviting them). From the many journals that already in some form pubish their review reports this would be not dramatically different, but the benefit would be having the feedback systematically collected in a centralized public repository.

A model...?

I'm also a sucker for statistics, especially the Bayesian flavoured one^[2]. So I started to think of this quantification problem in terms of measurement.

First of all, I do not think we can ask reviewers to provide an explicitly quantitative judgement. People are notoriously bad at that. What we can get is at best some kind of ordinal information (basically a Likert scale).

The authors also argue in favour of a one-dimensional measure of quality. I have mixed feelings in this respect. On the one hand, research clearly has many dimensions. A paper could have great theorizing and crappy methods or viceversa. Collecting information about multiple dimensions does not stop us computing and/or reporting a single measure if we really want to provide a simplified information to the public. The biggest limitation I see with this is that the relevant dimensions are not unique nor uniformly applicable to all research efforts. A “data collection” score is not applicable to a theoretical contribution. But even a rigid definition of categories that could be optionally applied when relevant smell like a further step into research bureaucratization and standardization, which is not ideal.

Anyway, imagine we have multiple quantitative judgements of a piece of work provided by different people. How could we sensibly try and aggregate them? A simple mean would not be incredibly satisfactory. I mean, multiple people giving their opinions do not really work anywhere close to multiple scales weighting the same bowl of flour, do they?

There is a lot of structure in this kind of measurement, it is not random by any means, it is not going to go away simply by having more data! People might have different standards and tend to systematically give higher or lower scores. They could be biased (positively or negatively) towards topics, methods, theoretical orientations or even have beef with the authors of a certain work^[3].

I feel like a reasonable starting point to model this kind of data would be something like:

S_ij ~ P_i + R_j + B_ij + ...?

Reading along the lines of: the Score given by Reviewer j to the Paper i is equal to the sum of four (?) components.

The quality of the paper P_i
The propensity of the reviewer towards higher/lower scores
The bias or the reviewer with respect to the paper
And probably some other things floating around

Now, I have no idea how to measure reliably the “bias” of the reviewer regarding a specific paper. Or rather, I have many ideas but none that is at least somewhat convincing.

I already tried, for a little project, to use the distance between a manuscript abstract and the abstracts of publications by reviewers, computed with a text-embedding model. The thing is, I’m not sure this would be reliable, it makes tons of assumptions, it requires a ton of (notoriously not great in quality) bibliometric data, and the results might heavily depend on the specific model you use.

Another alternative would be to use the agreement with the other reviewers on other manuscripts. This could allow to “weight” relatively less redundant opinions. The end result would be something like “we expect higher quality when people that usually disagree give uniformly high scores”. This sounds similar to the algorithm behind Twitter/X community notes^[4], but without (?) the one-dimensional assumption of the opinion space. The obvious limit however would be the extreme sparsity of the matrix, since not everyone rates everything (not even close).

Another one could be something along the lines of an IRT model. Where papers are the analog of students taking an exam, and reviewers are the “questions”; some of them are more difficult, some are able to better discriminate between good and bad outcomes.

Except that... Now that we are reasoning in terms of a statistical model maybe we need to take a step back, and get a little bit more of a theoretical understanding of the phenomenon and of the property we are trying to measure? After all, Statistical Control Requires Causal Justification! Here instead we are merely saying that the “quality” of a manuscript is whatever is not explained away by the marginal contributions of a bunch of reviewer traits... But, just to make an example, the topic of a paper could easily causally impact its quality as well as the reviewer bias, thus being a confounder. And we are not even considering the self-selection of reviewers into giving feedback to the papers they are interested in! Again, I have more doubts than answers, but I think there is potential in trying to tackle difficult issues.

Anyway, some more thoughts

This piece is closer to a stream of thought than a fully formed reflection on the topic. Thinking about this, however, brought me to reconsider how I think of peer review in its current form.

If we frame it as a measurement issue, we are essentially engaging in a sort of collective suspension of disbelief. None of the empirical researchers I know would ever be confident in drawing conclusions with two to three data points. You can be a p-value lover, a confidence interval enjoyer or having reached your inner peace with posterior distributions. But a couple of observations are probably not enough to convincingly conclude anything about a complex question like “is this paper worth publishing/reading/citing/whatever”.

This is to say, we have a way higher standard of evidence in our work than we have in assessing it. We might have no better alternative, there might be other good reasons for this, but I think this is at least worth noting.

For the records: I remember vaguely a Tweet (or was it on Bluesky?) by @rlmcelreath discussing a study on peer review and touching the same point probably better than I’m doing here: i.e. with just a couple of observations you have no statistical power to reliably draw conclusions. The search function is not helping me but if I find it I will link to it later on.