A Translation lookaside buffer (TLB) is a CPU cache that memory management hardware uses to improve virtual address translation speed. It was the first cache introduced in processors. All current desktop and server processors (such as x86) use a TLB. A TLB has a fixed number of slots that contain page table entries, which map virtual addresses to physical addresses. It is typically a content-addressable memory (CAM), in which the search key is the virtual address and the search result is a physical address. If the requested address is present in the TLB, the CAM search yields a match quickly, after which the physical address can be used to access memory. This is called a TLB hit. If the requested address is not in the TLB, the translation proceeds by looking up the page table in a process called a page walk. The page walk is a high latency process, as it involves reading the contents of multiple memory locations and using them to compute the physical address. Furthermore, the page walk takes significantly longer if the translation tables are swapped out into secondary storage, which a few systems allow. After the physical address is determined, the virtual address to physical address mapping and the protection bits are entered in the TLB.
Contents[hide]
|
[edit] Overview
The TLB references physical memory addresses in its table. It may reside between the CPU and the CPU cache or between the CPU cache and primary storage memory. This depends on whether the cache uses physical or virtual addressing. If the cache is virtually addressed, requests are sent directly from the CPU to the cache, which then accesses the TLB as necessary. If the cache is physically addressed, the CPU does a TLB lookup on every memory operation, and the resulting physical address is sent to the cache. There are pros and cons to both implementations.
A common optimization for physically addressed caches is to perform the TLB lookup in parallel with the cache access. The low-order bits of any virtual address (e.g., in a virtual memory system having 4KB pages, the lower 12 bits of the virtual address) represent the offset of the desired address within the page, and thus they do not change in the virtual-to-physical translation. During a cache access, two steps are performed: an index is used to find an entry in the cache's data store, and then the tags for the cache line found are compared. If the cache is structured in such a way that it can be indexed using only the bits that do not change in translation, the cache can perform its "index" operation while the TLB translates the upper bits of the address. Then, the translated address from the TLB is passed to the cache. The cache performs a tag comparison to determine if this access was a hit or miss. It is possible to perform the TLB lookup in parallel with the cache access even if the cache must be indexed using some bits that may change upon address translation; see the address translation section in the cache article for more details about virtual addressing as it pertains to caches and TLBs.
[edit] Software TLB and hardware TLB
A software TLB or a software managed TLB, is a TLB that the operating system can manipulate. Instruction sets of CPUs that implement TLB have instructions that control TLB management functions, such as TLB flushing. The format of the TLB entry is also defined as a part of the ISA[1]. SPARC and MIPS provides examples of designs that implement such a TLB.
A hardware managed TLB, like on x86, is designed to operate invisible to the operating system. Filling the TLB with address translations and flushing the TLB are under the control of dedicated hardware. In fact, if the TLB were removed from the CPU (not that it is a good idea), the programs would exhibit no difference, except that the time for executing them would increase. Moreover, since it is invisible to the OS and the ISA, the format of the TLB entries can be defined as needed and this definition can change from CPU to CPU without causing loss of compatibility for the programs. The Itanium architecture provides an option of using either software or hardware managed TLBs[2].
[edit] Miss
Two schemes for handling TLB misses are commonly found in modern architectures:
- With hardware TLB management, the CPU itself walks the page tables (using the CR3 register on x86 for instance) to see if there is a valid page table entry for the specified virtual address. If an entry exists, it is brought into the TLB and the TLB access is retried: this time the access will hit, and the program can proceed normally. If the CPU finds no valid entry for the virtual address in the page tables, it raises a page fault exception, which the operating system must handle. Handling page faults usually involves bringing the requested data into physical memory, setting up a page table entry to map the faulting virtual address to the correct physical address, and resuming the program (see Page fault for more details.)
- With software-managed TLBs, a TLB miss generates a "TLB miss" exception, and the operating system must walk the page tables and perform the translation in software. The operating system then loads the translation into the TLB and restarts the program from the instruction that caused the TLB miss. As with hardware TLB management, if the OS finds no valid translation in the page tables, a page fault has occurred, and the OS must handle it accordingly. This is how TLB misses are handled on MIPS CPUs for instance[3].
[edit] Typical TLB
- Size: 8 - 4,096 entries
- Hit time: 0.5 - 1 clock cycle
- Miss penalty: 10 - 30 clock cycles
- Miss rate: 0.01% - 1%
Computer Organization And Design. Burlington, MA 01803, USA: Morgan Kaufmann Publishers. 2007. p. 523. ISBN 978-0-12-370606-5.
If a TLB hit takes 1 clock cycle, a miss takes 30 clock cycles, and the miss rate is 1%, the effective memory cycle rate is an average of
(1.30 clock cycles per memory access).
In a Harvard architecture or hybrid thereof, a separate virtual address space may exist for instruction and data caching. This can lead to distinct TLB buffers for each of the caches (instructions, data, or unified TLB).
[edit] Context switch
On a context switch, some TLB entries can become invalid, since for example the previously running process had access to a page, but the process to run does not. The simplest strategy to deal with this is to completely flush the TLB. Newer CPUs have more efficient strategies; for example in the Alpha 21264, each TLB entry is tagged with an "address space number" (ASN), and only TLB entries with an ASN matching the current task are considered valid. Another example in the Intel Pentium Pro, the page global enable (PGE) flag in the register CR4 and the global (G) flag of a page-directory or page-table entry can be used to prevent frequently used pages from being automatically invalidated in the TLBs on a task switch or a load of register CR3.
While selective flushing of the TLB is an option in software managed TLBs, till now the only option in hardware TLBs has been the complete flushing of the TLB on a context switch.
[edit] Virtualization and x86 TLB
With the advent of virtualization for server consolidation, a lot of effort has gone into making the x86 architecture easier to virtualize and to ensure better performance of virtual machines on x86 hardware[4][5]. In a long list of such changes to the x86 architecture, the TLB is the latest.
Normally, the entries in the x86 TLBs are not associated with any address space. Hence, every time there is a change in address space, such as a context switch, the entire TLB has to be flushed. Maintaining a tag which associates each TLB entry with an address space in software and comparing this tag during TLB lookup and TLB flush is very expensive, especially since the x86 TLB is designed to operate with very low latency and completely in hardware. In 2008, both Intel (Nehalem)[6] and AMD (SVM)[7] have introduced tags as part of the TLB entry and dedicated hardware which checks the tag during lookup. Even though these are not fully exploited, it is envisioned that in the future, these tags will identify the address space to which every TLB entry belongs. Thus a context switch will not result in the flushing of the TLB - but just changing the tag of the current address space to the tag of the address space of the new task.
No comments:
Post a Comment