If that’s happening, you are still using the shittest non-reasoning types of LLM. They’re basically trying to recite the DOI from memory. You want to use something that actually goes and checks the sources itself.
One issue is that a lot of websites have now blocked the LLMs from accessing them. So, for scientific papers, a lot of ChatGPT is getting routed through Researchgate, university websites etc, because it can’t directly access papers from some publishers. However, it will sometimes try to bluff and tell you about something where it’s only read the citation (title, authors, year etc) and not the content.
This will be a dream, but I think it will be possible in the future. It takes a huge amount of compute to process that much data at once, and the amount that we can ever access as customers is tiny. Every model you access is throttled and not using the full capacity.
For example, the free version of ChatGPT gives you access to a model with 30-40B parameters. ChatGPT 4o gives you 175B. ChatGPT Pro ($200/month) gives you 1.5T, and internally they apparently have models with up to 10T. So the people asking free version ChatGPT questions are working with something less than 0.3% of the potential capability.
There is also a limit to the working memory on the models which are provided to us. So at some point, the model is summarising the content of the conversation. GPT5 can hold around 128-256k tokens (less than 200 pages of text) in memory. So if you ask it to analyse 20 documents, it’s going to internally summarises them, then works with summaries of summaries etc - the analysis get shallower as it holds more information. However, at the moment, that is hard built into the GPT model, and no amount of compute will fix that. So you’re right that it needs new architecture of some sort.
Out of curiosity, if you search “Rapamycin” on Pubmed, you get around 60,000 results. That’s a lot, but not an obscene amount of data, especially if you get rid of a bunch of duplicates, reviews, irrelevant studies etc. If you picked the top 10,000, that would be 100m tokens, so the current model is around 400x too small. Wonder how long it will take to bring it up to the size needed to do that sort of thing.