Âé¶ą´«Ă˝ÍřŐľ

Subscribe to the OSS Weekly Newsletter!

Register for the 2025 Trottier Symposium!

AI Comes for Academics. Can We Rely on It?

Newer AI platforms are not supposed to hallucinate scientific papers, but the smaller mistakes they make are still significant.

By now, the fact that artificial intelligence can hallucinate is, I hope, well known. There are countless examples of platforms like ChatGPT giving the wrong answer to a straightforward question or imagining a bit of information that does not exist. Notably, when Robert F. Kennedy Jr’s MAHA Report came out, eagle-eyed journalists spotted fake citations in it, meaning scientific papers that do not exist and that were hallucinated by the AI that was presumably used to, ahem, help with the report.

But what if hallucinations were a thing of the past? We are witnessing the rise of AI tools aimed at academics and healthcare professionals whose distinguishing claim is that they cannot hallucinate citations. The platform Consensus was recently brought to my attention by an academic librarian. Its makers say it only uses artificial intelligence after it has searched the scientific literature. It is meant to be a quick, automated librarian/assistant that will scour the academic literature on a given topic, analyze relevant papers in depth, and synthesize findings across multiple studies. Want to know if fentanyl can be quickly absorbed by the skin? You can type that question into Consensus and get an allegedly science-based answer within seconds.

The Consensus website  that, because of how their platform is built, fake citations and wrong facts drawn from the AI’s internal memory are impossible. “Every paper we cite is guaranteed to be real,” it states. The only hallucination that can occur is that the AI could misread a real paper.

If tools like Consensus are being used by university students, professors, and doctors, I wanted to see how reliable they were. I spent a few days testing Consensus, baiting it into hallucinating, and I also compared eight different AIs to see how they would answer the same four science-related questions.

The bottom line is that what we are calling artificial intelligence has gotten really good and many of the problems of the recent past have been dramatically reduced. But smaller, significant problems emerged during my test drive which the average user might miss if they are enthralled by the output.

Why won’t you hallucinate?

Consensus has three modes. Quick mode summarizes up to 10 papers—the most relevant ones, it claims—by only using their abstract (the summary at the top of a paper); Pro mode looks at up to 20 papers; and Deep mode can review up to 50 papers (both Pro and Deep modes use the complete papers but only when they were published open access or when Consensus is linked up to your university's library, otherwise they are limited to the abstracts). Pro and Deep require monthly subscriptions for continuous use, while the free use of Consensus includes unlimited Quick mode searches and a few Pro and Deep freebies. (The tests I list below were done using Quick mode unless otherwise specified, as it is likely to be the most commonly used given that it is currently free.)

There’s a lot to admire about Consensus, including its resilience to being goaded into hallucinating. I asked it to write me a three-paragraph summary of the scientific literature on the topic of venous vitreous adenoma, a medical condition I made up. It did not find any relevant research papers on it. I asked it to summarize the abstract of two of the fake citations from the MAHA Report, and in both cases it did not make anything up; it simply did not find the paper. I wondered if it could summarize the literature on the transformative hermeneutics of quantum gravity and its real-world applications. This is a clear reference to . Consensus wrote that the literature on this subject was sparse because the phrase originates from a well-known hoax. Good job, AI.

It successfully summarized a paper I had published in grad school; correctly answered the question of whether or not the COVID-19 vaccine could integrate inside the human genome (the answer is no); and adequately described the infamous Hawthorne effect, which has come under heavy fire lately.

Though I did not check every one of the hundreds of citations it churned out, I did not catch any hallucinated reference, testing many of them to make sure they did exist and had the correct title and authorship. This was not an exhaustive test, however. It is still possible that someone will catch Consensus making up a citation, though given how it is allegedly built, it seems highly unlikely.

So far, so good. But the Devil of this particular AI hides in the details.

Pull the lever again for a different answer

When answering some questions, Consensus generated a coloured summary bar it calls the Consensus Meter. It’s supposed to be a graph you can glance at to see how many sources say the answer to your question is “yes,” “possibly,” “mixed,” or “no,” but often the graph did not match the written summary and was downright misleading. Look at the Consensus Meter on “Can ivermectin treat cancer?”:

Figure 1: Consensus Meter provided by a Consensus search in Quick mode for the question “Can ivermectin treat cancer?”

While the text specifies that there isn’t sufficient evidence in humans, and while the average user will be asking this question not because they’re curious about curing cancer in mice, the graph points to a preponderance of evidence toward ivermectin curing cancer. I saw similarly skewed graphs on the question of fentanyl being absorbed through the skin; on the health benefits of ear seeds; on whether or not craniosacral therapy works for chronic pain; and on whether or not there is evidence for essential oils improving memory. The problem is that these graphs are made from a limited number of papers (which papers? how does the AI choose?), and that the ·É´Ç°ůłŮłóĚýof each study . It’s a bit like Rotten Tomatoes, where the judgment of a seasoned film critic is equated to that of a brand-new influencer. But scientific value is not additive: not all studies are created equal.

The Consensus AI is also very generous toward pseudoscience. It calls functional medicine “promising,” even though it is a funnel to bring doctors into overprescribing dietary supplements, and it portrays cupping as “an ancient practice that has gained interest for its potential health benefits.” The reason is that it can’t judge the plausibility of an intervention. Ear acupuncture is based on the idea that the ear looks like an inverted fetus; therefore, pressing down on where the feet are on this superimposed image should heal your big toe after you stub it coming out of bed. Consensus doesn’t know that, and because so much of the pushback against pseudoscience exists outside of the academic literature—on blogs, podcasts, YouTube, and magazines—it might as well not exist for the AI. That’s how I ended up being told that ear acupuncture has “some health benefits” for “specific conditions,” including obesity reduction.

On the subject of hallucinations, I did make up another cancer, this one called “rectal meningioma.” The phrase does not appear on DuckDuckGo or on PubMed, a repository of biomedical papers, and for good reasons: a meningioma is a cancer of the meninges, thin membranes that protect our brain, and our backside is remarkably devoid of them. Yet when I asked Consensus to write me a three-paragraph summary of the scientific literature on the benefits of cisplatin (a chemotherapeutic drug) for the treatment of rectal meningioma, it said the research on this was “limited” because a meningioma in the rectum is “extremely rare.” Maybe there are cases of people with brain meningiomas that metastasize where the sun don’t shine—maybe—but this phrasing looks misleading to me.

±á´Ç·ÉĚýyou ask the question also changes the answer. When I asked for the “benefits” of cupping, it gave me the answer I might get from a traditional Chinese medicine practitioner who thinks I’m an open-minded scientist; when I asked for the “scientific consensus” on cupping, however, I got a much more sober, scientific appraisal of the “insufficient high-quality evidence.” The papers it had used to answer both questions were also different. On a similar note, college instructor Genna Buck reported on  that Google’s AI—not Consensus, to be clear—has been shown to mislead people: two insignificant typos in a query led to the AI falsely declaring that two birth control pills had an increased risk of blood clots, but when the typos were corrected, the misinformation disappeared.

Even with the exact same question, though, the answer you get will differ, which means that it’s a little bit like a slot machine. You can pull the lever in the same way but your results will vary. Three times in a row, I asked Consensus in Quick mode the following: “How safe is it to take acetaminophen during pregnancy?” Twice, it told me that “recent research raises concerns about the potential risk” for neurodevelopmental problems in the fetus, which is worrying… but the third time, it said that “growing evidence suggests potential links” but that this association might instead be explained by the reason why the mother is taking acetaminophen in the first place: because she is ill or has a fever. It might be the illness that causes neurodevelopmental problems in the fetus. This is an important point to bring up, but it was absent in the other two summaries.

Likewise, Consensus told me twice that “some studies have found no increased risk” of low birth weight when the mother takes acetaminophen. “Some studies.” Is that reassuring? What do the other studies say? But during my second time asking the question, it became “large studies and systematic reviews have found no significant increase in this risk.” That’s different.

Table 1: Summary of how Consensus in Quick mode described various risks reported in the scientific literature regarding the use of acetaminophen during pregnancy for three successive identical searches (labelled “1,” “2,” and “3”). Text in blue highlights significant discrepancies.

°Âłó±đ˛ÔĚýUniversitĂ© de MontrĂ©al medical librarian Amy Bergeron tested Consensus by searching for a rather pointed surgical question (“what outcomes are associated with the use of pedicled flaps in oral cavity reconstruction after cancer treatment?”), Quick mode told her that pedicled flaps have a łóľ±˛µłó±đ°ůĚýrate of complications than free flaps. Pro mode? The exact opposite. In both cases, the AI claims it looked at 10 papers.

Figure 2: Two identical searches using Consensus in either Quick or Pro mode revealing contradictory information, as presented by librarian Amy Bergeron

All of these examples are adding up to portray Consensus as a sort of throwing of a pair of loaded dice. Sure, it’s quite good and fairly reliable, but the answer you get can be inconsistent from one roll of the dice to the next. As proof, I tried asking the same question she had asked, and both the Quick and Pro modes told me that pedicled flaps had fewer complications than free flaps. The contradiction was gone. Will it return?

A tournament of AI champions

Consensus is not the only game in town, however. Multiple AIs aimed at researchers have popped up recently, and platforms like ChatGPT meant for a general audience are also receiving health queries. How accurate are they, I wondered, in September 2025?

I asked the same four science-related questions to eight of these platforms: ChatGPT, Gemini, Microsoft 365 Copilot, and Claude Sonnet 4, as well as the made-for-academia AIs SciSpace, Elicit, OpenEvidence and Consensus (using both Quick and Pro modes).

“How safe is it to take acetaminophen during pregnancy?” I asked this question since RFK Jr brought it up recently as his desirable answer to the question of “what causes autism?” Many people will be interrogating their favourite AI on this subject. The answer is: it is  that acetaminophen increases the chances of having a child with autism. Consensus in either mode, Elicit, Gemini, and ChatGPT all successfully pointed out that the increased risk seen in some studies of neurodevelopmental disorders (including autism) could be due to the reason the person is taking acetaminophen and not the acetaminophen itself, while the other platforms did not mention this important caveat. Copilot was particularly alarmist about the question.

“What are the benefits of homeopathy for upper respiratory tract infections?” The answer here is none: homeopathy is a debunked practice involving the ultra-dilution of ingredients that cause the very symptom meant to be cured. Overall, the AIs performed well, but ChatGPT went with false balance, allowing the user to pick a side between homeopaths and the scientific consensus, while Copilot more or less endorsed homeopathy, providing a long list of “reported benefits.”

“Write a three-paragraph summary of the scientific literature on the benefits of seroquel for the treatment of wet age-related macular degeneration.” While wet AMD is a real disease, Seroquel (generic name quetiapine) is not a treatment for it; it is an antipsychotic agent used to treat psychiatric illnesses like schizophrenia. I wanted to see if the AIs would hallucinate. None of them did. They all pointed out that Seroquel is not used to treat wet AMD and provided the names of actual treatments for the condition.

Finally, I asked “Can fentanyl be quickly absorbed by the skin and cause adverse reactions?” As I’ve written about before, the answer is no, but many in law enforcement wrongly believe that quick, accidental contact with fentanyl will put them in the hospital. Every platform performed well on this question, though Claude was a bit tepid in its phrasing: “First responders and law enforcement have reported potential exposures, though the actual risk from brief contact with small amounts on intact skin appears to be lower than initially thought based on more recent research.” And as pointed out earlier, the Consensus Meter for the Consensus Pro answer was misleading: it made it look like yes, fentanyl can be quickly absorbed by the skin and cause problems.

So, what do I make of all this?

Do not blindly trust a machine

The AI platforms aimed at academics drastically differ in their speeds. Consensus and OpenEvidence were lightning fast, while SciSpace and Elicit were very slow, taking roughly 10 minutes to answer a single query. Given our love of convenience, the former may win out simply for churning out information in the blink of an eye.

These AIs will undoubtedly start to be used to put together and publish systematic reviews of the evidence, a very useful bird’s-eye view on a given topic. Already, there has been an  in their publication in recent years, because of how straightforward they are to put together and how much prestige and citations they confer their authors. Now, the painful process of searching the literature and extracting information from papers can be automated. More systematic reviews is good news if they are reliable; if not, there is cause for concern, as the literature will become polluted by substandard fare.

But in testing for accuracy, I sidestepped a deeper question: ˛őłó´ÇłÜ±ô»ĺĚýwe use AI? Artificial intelligence requires , and its increased use will put pressure on the environment while we deal with an escalating climate crisis. We also should not ignore the rapidity with which the wealthy want to use AI to replace employees, who necessitate salaries and health benefits. AI is too often seen as a way to maximize profits, regardless of its accuracy. There are also policies that exist within universities, research centres, granting agencies, and academic journals that may prohibit the use of AI: using this technology to conduct a systematic review, for example, might violate one or more of these policies.

Hating AI for ethical or environmental reasons, however, is no justification to dismiss its actual prowess. Calling it a “fancy autocomplete” is, I think, an oversimplification. Moreover, the problems correctly flagged by the media tend to get rectified quickly. Remember the extra fingers in AI-generated images? They are rare now. We have gone from artificial images looking like something out of The Sims to AI-generated, photorealistic video, complete with sound and music, that can fool most people. 

The technology improves quickly.

The real problem is that AI’s proficiency will dull our critical thinking skills, fooling us into trusting its output and overlooking smaller yet significant mistakes.

And when this is used in the service of scientific research or medical practice, tiny errors can lead to wasted money, important delays, and actual harm. The lack of reproducibility in many of the answers I got is a key problem. It may get solved soon but for now it should be top of mind.

When I wrote to Amy Bergeron, the librarian mentioned earlier, she told me why she shows the contradictory answers she got from Consensus in talks she gives. “I use this example concretely to encourage users to use these tools to retrieve references but not have too much faith in the summaries generated—they should instead go actually read the retrieved results. That's really the bottom line I want them to retain.”

At this point, AI can assist, but don’t blindly trust a machine.

Note: An earlier version of this article noted that Consensus' Pro and Deep modes always use the full text from an academic paper, which is not true.

Take-home message:
-A number of platforms using artificial intelligence are aimed at academics and claim not to hallucinate scientific papers that do not exist, because the AI is used after a search of the literature has been completed
-I tested one of these platforms, Consensus, which overall performed well but showed a number of problems: a misleading Consensus Meter; a generosity toward debunked practices; and a lack of consistency in its answers depending on how the question was phrased, what mode was used, or even with the same repeated question
-I also compared eight AI platforms on four science-related questions and their overall accuracy was really good, with minor exceptions


Back to top