When lowering IR to machine IR then machine code, a lot of information is lost. Reconstructing the information is difficult and not all information can be reconstructed (although a lot may not be needed for post-link optimization). This is why BOLT’s disassembly concept also makes me nervous.
If rtld (or a similar component) wants to perform “LTO”. It either leverages high-level information like IR, or BOLT’s disassembly information. In either case, performing some optimizations will be a very slow process, and I don’t see this practical…
That said, some simple optimizations can be performed. For example, PLT entry generation can be moved to rtld like this x86-64 proposal to change some indirect jumps to direct jumps: https://groups.google.com/g/x86-64-abi/c/vbuHVMK_RIA
Technically, LTO can be made to work with DSOs, if we can guarantee that link-time DSOs are the same as run-time DSOs. However, this will be a lot of work and the gain will be unclear.
A runtime LTO scheme allows different link-time DSOs and run-time DSOs, but otherwise it is no better than LTO with DSOs. To allow different link-time DSOs and run-time DSOs, it looks like a lot of constraints are needed, making the scheme more niche.