Use "whole program" and "link time optimization" features of gcc. The whole-program optimization enables the compiler to evaluate the entire firmware for optimization instead of just one code file at a time. This leads to better overall optimizations.