L1 instruction cache (L1i) misses are a major source of
performance degradation when the instruction footprint cannot be
captured in the L1i. Sequential prefetchers like a next-line prefetcher
are a common solution to mitigate this problem. However, this
approach falls short of efficiency when the program has lots of
complex control flow changes that occur frequently. This observation
has motivated researchers to suggest a myriad of sophisticated
proposals to address this problem. However, we find that there is still
significant room for improvement. Hence, in this paper, we introduce
a new instruction prefetcher to exploit the available potential.
In this paper, we address the L1i cache miss problem using a divide
and conquer approach. We carefully analyze why an instruction
cache miss occurs and how it can be eliminated. We divide instruction
cache misses into sequential and discontinuity misses. A sequential
miss is a cache miss that is spatially right after the last accessed block,
and the remaining misses are discontinuities.
While sequential and discontinuity prefetchers are already
proposed, in this paper, we show that the conventional implementation
of these prefetchers cannot adequately cover the misses because of
their shortcomings. Accordingly, we recommend an enhanced
implementation of each prefetcher. We find that for a sequential
prefetcher, there is a trade-off between timeliness and accuracy.
Consequently, we propose SN4L+wrong SNL prefetcher that
attempts to provide both accurate and timely prefetches. Moreover,
a conventional discontinuity prefetcher uses a single discontinuity
target for each record, and as such, its lookahead is limited to a single
discontinuity ahead of the execution stream, which limits its efficiency.
On top of that, it records an instruction block address per record that
results in a considerable storage cost. We introduce Dis prefetcher
to address the shortcomings. Our proposal offers 25% speedup
as compared to the baseline without any prefetcher when 128 KB
storage budget is provided for it and outperforms the state-of-the-art
prefetcher by 3% when a small 8 KB storage budget is provided to it.