LRU

Description

LRU is an extension and modification of FIFO that is critical to cache replacement algorithms, sorting algorithms and compression algorithms (and by extension, crypto). It's important to both gather existing resources on this fundamental family of algorithms, as well as keep track of my own explorations and musings.

Details

Let's simply start with a simple algorithm pointed to by Wikipedia and that seems to have been widely used :

pseudo-LRU

two-way set associative - one bit

   indicates which line of the two has been reference more recently


four-way set associative - three bits

   each bit represents one branch point in a binary decision tree; let 1
   represent that the left side has been referenced more recently than the
   right side, and 0 vice-versa

              are all 4 lines valid?
                   /       \
                 yes        no, use an invalid line
                  |
                  |
                  |
             bit_0 == 0?            state | replace      ref to | next state
              /       \             ------+--------      -------+-----------
             y         n             00x  |  line_0      line_0 |    11_
            /           \            01x  |  line_1      line_1 |    10_
     bit_1 == 0?    bit_2 == 0?      1x0  |  line_2      line_2 |    0_1
       /    \          /    \        1x1  |  line_3      line_3 |    0_0
      y      n        y      n
     /        \      /        \        ('x' means       ('_' means unchanged)
   line_0  line_1  line_2  line_3      don't care)

   (see Figure 3-7, p. 3-18, in Intel Embedded Pentium Processor Family Dev.
    Manual, 1998, http://www.intel.com/design/intarch/manuals/273204.htm)


note that there is a 6-bit encoding for true LRU for four-way set associative

  bit 0: bank[1] more recently used than bank[0]
  bit 1: bank[2] more recently used than bank[0]
  bit 2: bank[2] more recently used than bank[1]
  bit 3: bank[3] more recently used than bank[0]
  bit 4: bank[3] more recently used than bank[1]
  bit 5: bank[3] more recently used than bank[2]

  this results in 24 valid bit patterns within the 64 possible bit patterns
  (4! possible valid traces for bank references)

  e.g., a trace of 0 1 2 3, where 0 is LRU and 3 is MRU, is encoded as 111111

  you can implement a state machine with a 256x6 ROM (6-bit state encoding
  appended with a 2-bit bank reference input will yield a new 6-bit state),
  and you can implement an LRU bank indicator with a 64x2 ROM

Of course the 1998 link on the Intel website has long been broken but this gives us a first approximation :

2-sets uses 1 bit. This can't be more simple or easy and the logic is truly minimal. Go for it everytime you can :-)
4-sets is more complex. There are only 3 bits if pseudo-LRU is good enough for you, but true LRU now has to be distinguished and grows as N!, so you'll need 6 bits and a 256-bits ROM.

How can one build larger systems ?

Wikipedia lists many strategies but it is desirable to get "most" of the true-LRU benefits without the size, time and costs.

Project Logs

Collapse

4-LRU
Yann Guidon / YGDES • 5 hours ago • 0 comments
The permutation deamon is biting again, while I investigate a related caching subject. I am now considering the full LRU for 4 ways, using 5 bits arranged in 2-2-1 fields.

Field A : 2 bits, it is a fully encoded (4 values from 00 to 11) that directly points to the least recently used way. It is handy because the value is directly available.

Field B : The middle 2-bit field has only 3 values, and 00 is forbidden (marks a cleared set). so the number of the 2nd-to-least recently used is a XOR with field A. The XOR is not required, after all, but not too heavy on the circuit anyway.

Field C: 1 bit, selects which one of the 2 remaining sets comes first. The logic gets complex here, but now I wonder if that part matters at all. Maybe this field can be simply skipped and removed, because it does not affect the LRU much.

In fact what matters is the least recently used, not the most recently used. So the first 2 can be skipped.

All one has to do is compare a new set with the fields A and B, and if there is no match, no update.

If there is a match then the matched field is replaced with the one on the left : this acts like a sort of shift register. And given the fields A and B, some boolean logic is enough to restore one of the missing set numbers.

This greatly reduces the size of the LRU logic, probably as small as the IBM 3tree system but slightly better LRU behaviour. And there is no large lookup table as suggested in the project intro.

***************************************

What's crazy is : while writing it, it all made sense and it was totally unexpected.

Here is some pseudocode:
```
A : 2 bits \ from LRU
B : 2 bits /
B' : 2 bits, is actual B.
S : 2 bits (new set to update)
A2, B2 : updated LRU fields.

B' = A xor B;

If S != A and S != B' : Do nothing.
Else
   if S == A then
     A2 <= B
     B2 <= newB(A, B', S)
   else
     if S == B' then
       -- A is not modified
       B2 <= newB(A, B', S)
```
So that's actually a few XOR and MUX.

The secret sauce is in the function newB that creates the new value of B, which is "guesstimated" from A, B' and S. This is used in both branches so we know that either S==A or S==B', but we also know that A!=B. So we can use B, because it's already the difference which is never 0 ! (which could add a bias...)

Here is the updated pseudo-VHDL code:
```
B' := A xor B;

A2 <= B'       when  S == A               else A;
B2 <= B xor A2 when (S == A) or (S == B') else B;

update if (S == A) or (S == B')
```
I'm still uncertain for the formula of B2 but you get the idea : it's quite simple. No weird LUT to precompute.

View project log

Discussions

Yann Guidon / YGDES wrote 6 hours ago

5-way seems easy :

2-way is 1 bit (easy).

3-way (6 permutations) requires 1+2 bits, with the 2-bit field using only 3 codes.

4-way (24) adds another 2-bit field, the new one is fully used.

5-way would require another 3-bit field that is not fully used. But we notice that 3×5=15, which fits in 4 bits. So the 3-bit field and the first 2-bit field can be merged.

So the 5-way full LRU permutations fit in 1+2+4=7 bits, which makes sense since 5!=120=2^7 - 8.

Interesting.

The remaining 8 codes can be used to indicate reset status for example.

Are you sure? yes | no

Yann Guidon / YGDES wrote 09/06/2021 at 06:10

https://arxiv.org/abs/1512.00727 :

[Submitted on 2 Dec 2015 , last revised 3 Dec 2015 (this version, v2)]

TinyLFU: A Highly Efficient Cache Admission Policy

Gil Einziger, Roy Friedman, Ben Manes

This paper proposes to use a frequency based cache admission policy in order to boost the effectiveness of caches subject to skewed access distributions. Given a newly accessed item and an eviction candidate from the cache, our scheme decides, based on the recent access history, whether it is worth admitting the new item into the cache at the expense of the eviction candidate.

Realizing this concept is enabled through a novel approximate LFU structure called TinyLFU, which maintains an approximate representation of the access frequency of a large sample of recently accessed items. TinyLFU is very compact and light-weight as it builds upon Bloom filter theory.

We study the properties of TinyLFU through simulations of both synthetic workloads as well as multiple real traces from several sources. These simulations demonstrate the performance boost obtained by enhancing various replacement policies with the TinyLFU eviction policy. Also, a new combined replacement and eviction policy scheme nicknamed W-TinyLFU is presented. W-TinyLFU is demonstrated to obtain equal or better hit-ratios than other state of the art replacement policies on these traces. It is the only scheme to obtain such good results on all traces.

A much earlier and shorter version of this work appeared in the Euromicro PDP 2014 conference

Are you sure? yes | no

Description

Details

Project Logs

Collapse

4-LRU

Discussions

Similar Projects

1 - bit ttl / cmos finite state machine

16 Bit CPU in Verilog HDL

Embeded Sieve of Eratosthenes

Short, Fast, Floating Point for AVR

LRU

Become a Hackaday.io member

Just one more thing

Description

Details

Project Logs Collapse

4-LRU

Enjoy this project?

Discussions

Become a Hackaday.io Member

Similar Projects

1 - bit ttl / cmos finite state machine

16 Bit CPU in Verilog HDL

Embeded Sieve of Eratosthenes

Short, Fast, Floating Point for AVR

Does this project spark your interest?

Report project as inappropriate

Send message

Remove Member

Project Logs

Collapse