@Thorham
This won't work. A MOD is smaller because it uses samples, usually at a low quality (short, low samplerate, bitresolution, mono etc.). In a regular music production (if done using samples), they could easily exceed the memory needed to store the final mixdown. As soon as you start to use analog things like voice recordings, real instruments etc. the amount of information needed to describe this goes immediately up.
Apart from that, mixing the channels is a lossy process and cannot be reverted. What we hear is extemely lossy, even worse than an mp3.
What the human ear is doing is tracking instruments/noises by harmonic frequencies or other frequency distributions that have been learned before. Doing that, they miss a lot of stuff that is actually in the music. You might hear it the next time you listen, but then ignore something else. The ear uses a lot of information available prior to listening to the music.
E.g. put it to the extreme, you could write a codec that stores the entire music that will ever be encoded/decoded using this codec in a database. Then, encoding/decoding comes down to an index lookup, every music score is compressed to an integer.
But, at the end, you have to store the information, in the encoded data or outside from meta knowledge.
The same with predictors in looseles codecs. They represent prior knowledge that is assumed and put into the decoder. Once the data doesnt obey this knowledge, your codec screws up, e.g. on white noise.
The best predictors achieve roughly 1:2. Just accept this. Anything else is lossy.
mp3 or ogg are pretty good by compressing 1:12. They make assumptions how the human ear works.