How important is L3 cache for AMD processors?

Indeed, it makes sense to equip multi-core processors with dedicated memory that will be shared by all available cores. In this role, a fast L3 cache can significantly speed up access to data that is requested most often. Then the cores, if there is such an opportunity, will not have to access the slow main memory (RAM, RAM).

At least in theory. Recently AMD announced the Athlon II X4 processor, which is a Phenom II X4 model without L3 cache, hinting that it is not so necessary. We decided to directly compare two processors (with and without L3 cache) to see how the cache affects performance.

Click on the picture to enlarge.

How does the cache work?

Before we dive into the tests, it's important to understand some basics. The principle of the cache is quite simple. The cache buffers data as close as possible to the processing cores of the processor in order to reduce CPU requests to more distant and slower memory. On modern desktop platforms, the cache hierarchy includes as many as three levels that precede access to random access memory. Moreover, the caches of the second and, in particular, the third levels serve not only for data buffering. Their purpose is to prevent CPU bus overload when the cores need to exchange information.

Hits and misses

The effectiveness of the cache architecture is measured by the percentage of hits. Requests for data that can be satisfied by the cache are considered hits. If this cache does not contain the required data, then the request is passed further along the memory pipeline, and a miss is counted. Of course, misses lead to more time it takes to get the information. As a result, "bubbles" (downtime) and delays appear in the computational pipeline. Hits, on the other hand, allow you to maintain maximum performance.

Cache writing, exclusivity, coherence

Replacement policies dictate how cache space is made available for new entries. Since data written to the cache must appear in main memory sooner or later, systems can do this at the same time as writing to the cache (write-through) or they can mark the data area as "dirty" (write-back), and write to memory when it will be evicted from the cache.

Data in several cache levels can be stored exclusively, that is, without redundancy. Then you will not find identical data lines in two different cache hierarchies. Or caches can work inclusively, that is, the lower levels of the cache are guaranteed to contain data present in the upper levels of the cache (closer to the processor core). AMD Phenom uses an exclusive L3 cache, while Intel follows an inclusive cache strategy. Coherence protocols keep data consistent and up-to-date across cores, cache layers, and even processors.

Cache size

A larger cache can hold more data, but tends to increase latency. In addition, a large cache consumes a considerable number of processor transistors, so it is important to find a balance between the "budget" of transistors, die size, power consumption and performance / latency.

Associativity

Records in RAM can be directly mapped to the cache, that is, there is only one position in the cache for a copy of data from RAM, or they can be n-way associative, that is, there are n possible locations in the cache where this data might be stored. More high degree associativity (up to fully associative caches) provides the best caching flexibility because existing data in the cache does not need to be overwritten. In other words, a high n-degree of associativity guarantees a higher hit rate, but it increases latency because it takes more time to test all of these associations for a hit. As a rule, the highest degree of association is reasonable for the last level of caching, since the maximum capacity is available there, and searching for data outside this cache will result in the processor accessing slow RAM.

To give a few examples, the Core i5 and i7 use 32KB of L1 cache with 8-way associativity for data and 32KB of L1 cache with 4-way associativity for instructions. It is understandable that Intel wants instructions to be available faster, and the L1 cache for data has a maximum percentage of hits. L2 cache Intel processors has 8-way associativity, and Intel's L3 cache is even smarter as it implements 16-way associativity to maximize hits.

However, AMD is pursuing a different strategy with the Phenom II X4 processors, which uses L1 cache with 2-way associativity to reduce latency. To compensate for possible misses, the cache capacity was doubled: 64 KB for data and 64 KB for instructions. The L2 cache has 8-way associativity, like the Intel design, but AMD's L3 cache works with 48-way associativity. But the decision to choose one cache architecture or another cannot be judged without considering the entire CPU architecture. It is quite natural that test results are of practical importance, and our goal was just a practical test of this entire complex multi-level caching structure.

Every modern processor has a dedicated cache that stores the processor's instructions and data, ready for use almost instantly. This level is commonly referred to as the first cache level, or L1, and was first introduced with the 486DX processors. Recently, AMD processors have standardized to use 64k L1 cache per core (for data and instructions), while Intel processors use 32k L1 cache per core (also for data and instructions)

The first level cache first appeared on the 486DX processors, after which it became compound function all modern CPUs.

The second level (L2) cache appeared on all processors after the release of the Pentium III, although the first implementations of it on the package were in the Pentium Pro processor (but not on a chip). Modern processors are equipped with up to 6 MB of on-chip L2 cache. As a rule, such volume is divided between two cores on the processor Intel Core 2 duos, for example. Regular L2 configurations provide 512 KB or 1 MB of cache per core. Processors with smaller L2 cache tend to be in the lower price tier. Below is a diagram of early L2 cache implementations.

The Pentium Pro had the L2 cache in the processor package. Subsequent generations of the Pentium III and Athlon implemented the L2 cache through separate SRAM chips, which was very common at the time (1998, 1999).

The subsequent announcement of the process technology up to 180 nm allowed manufacturers to finally integrate the L2 cache on the processor die.


Early dual-core processors simply used existing designs when two dies were installed in a package. AMD introduced a dual-core processor on a monolithic die, added a memory controller and a switch, and Intel simply assembled two single-core dies in one package for its first dual-core processor.


For the first time, the L2 cache began to be shared between two computing cores on Core processors 2 duos. AMD went further and built its first quad-core Phenom from scratch, while Intel again used a couple of dies for its first quad-core processor, this time two dual-core Core 2 dies, to keep costs down.

L3 cache has existed since the early days of the Alpha 21165 processor (96 kB, introduced in 1995) or the IBM Power 4 (256 kB, 2001). However, in x86-based architectures, the L3 cache first appeared with Intel models Itanium 2, Pentium 4 Extreme (Gallatin, both processors in 2003) and Xeon MP (2006).

The first implementations provided just another level in the cache hierarchy, although modern architectures use the L3 cache as a large and shared buffer for data exchange between cores in multi-core processors. This is also emphasized by the high n-degree of associativity. It's better to look for data a little longer in the cache than to get a situation where several cores use very slow access to the main RAM. AMD first introduced L3 cache on a desktop processor along with the already mentioned Phenom line. The 65nm Phenom X4 contained 2MB of shared L3 cache, while the current 45nm Phenom II X4 has 6MB of shared L3 cache. The Intel Core i7 and i5 processors use 8 MB of L3 cache.

Modern quad-core processors have dedicated L1 and L2 caches for each core, as well as a large L3 cache that is shared by all cores. General cache L3 also allows data to be exchanged, which the cores can work on in parallel.


When performing various tasks, the necessary blocks of information from the RAM are received by the processor of your computer. Having processed them, the CPU writes the results of calculations to memory and receives subsequent blocks of data for processing. This continues until the task is completed.

The above processes are carried out at a very high speed. However, the speed of even the fastest RAM is significantly less than the speed of any weak processor. Each action, whether it is writing information to it or reading from it, takes a lot of time. The speed of the RAM is ten times lower than the speed of the processor.

Despite such a difference in the speed of information processing, the PC processor is not idle and does not wait for the RAM to issue and receive data. The processor is always working and all thanks to the presence of cache memory in it.

Cache is a special kind of RAM. The processor uses cache memory to store those copies of information from the computer's main RAM that are likely to be accessed in the near future.

In essence, the cache memory acts as a high-speed memory buffer that stores information that the processor may need. Thus, the processor receives the necessary data ten times faster than when reading them from RAM.

The main difference between a cache memory and a regular buffer is the built-in logic functions. The buffer stores random data, which is usually processed according to the "first received, first issued" or "first received, last issued" scheme. The cache contains data that is likely to be accessed in the near future. Therefore, thanks to the “smart cache”, the processor can run at full speed and not wait for data to be retrieved from slower RAM.

Main types and levels of L1 L2 L3 cache

Cache memory is made in the form of static random access memory (SRAM) chips, which are installed on system board or built into the processor. Compared to other types of memory, static memory can operate at very high speeds.

The cache speed depends on the volume of a particular chip, the larger the volume of the chip, the more difficult it is to achieve high speed for her work. Given this feature, during the manufacture of the processor cache memory is performed in the form of several small blocks, called levels. The most common today is the three-level cache system L1, L2, L3:

Cache memory of the first level L1 - the smallest in volume (only a few tens of kilobytes), but the fastest in speed and the most important. It contains the data most frequently used by the processor and runs without delay. Typically, the number of L1 memory chips is equal to the number of processor cores, with each core accessing only its own L1 chip.

L2 cache it is inferior to L1 memory in speed, but wins in volume, which is already measured in several hundred kilobytes. It is for temporary storage. important information, the probability of accessing which is lower than that of the information stored in the L1 cache.

Third level L3 cache - has the largest volume of the three levels (it can reach tens of megabytes), but also has the slowest speed, which is still significantly higher than the speed of RAM. The L3 cache is shared across all processor cores. The memory level L3 is intended for temporary storage of those important data, the probability of accessing which is slightly lower than that of the information stored in the first two levels L1, L2. It also ensures the interaction of the processor cores with each other.

Some processor models are made with two levels of cache memory, in which L2 combines all the functions of L2 and L3.

When a large amount of cache is useful.

You will feel a significant effect from a large amount of cache when using archiver programs, in 3D games, during video processing and encoding. In relatively "light" programs and applications, the difference is practically not noticeable ( office programs, players, etc.).

All users are well aware of such elements of a computer as a processor responsible for processing data, as well as random access memory (RAM or RAM) responsible for storing them. But not everyone probably knows that there is also a processor cache (Cache CPU), that is, the RAM of the processor itself (the so-called super-RAM memory).

What is the reason that prompted computer developers to use special memory for the processor? Isn't RAM enough for a computer?

Really, for a long time personal computers did without any kind of cache memory. But, as you know, the processor is the fastest device in a personal computer and its speed has grown with each new generation of CPU. Currently, its speed is measured in billions of operations per second. At the same time, standard RAM has not significantly increased its performance over the course of its evolution.

Generally speaking, there are two main technologies for memory chips - static memory and dynamic memory. Without delving into the details of their structure, we will only say that static memory, unlike dynamic memory, does not require regeneration; in addition, 4-8 transistors are used for one bit of information in static memory, while 1-2 transistors are used in dynamic memory. Accordingly, dynamic memory is much cheaper than static memory, but at the same time much slower. Currently, RAM chips are manufactured on the basis of dynamic memory.

Approximate evolution of the ratio of the speed of processors and RAM:

Thus, if the processor took information from RAM all the time, then it would have to wait for a slow dynamic memory, and he would be idle all the time. In the same case, if static memory were used as RAM, then the cost of the computer would increase several times.

That is why a reasonable compromise was developed. The main part of the RAM remained dynamic, while the processor got its own fast cache based on static memory chips. Its volume is relatively small - for example, the volume of the L2 cache is only a few megabytes. However, here it is worth remembering that all the RAM of the first IBM computers PC was less than 1 MB.

In addition, the expediency of introducing caching technology is also influenced by the factor that different applications, which are in RAM, load the processor in different ways, and, as a result, there is a lot of data that requires priority processing compared to the rest.

History of the cache

Strictly speaking, before the cache memory moved to personal computers, it had been successfully used in supercomputers for several decades.

For the first time, a cache memory of only 16 KB appeared in a PC based on the i80386 processor. Today's processors use various levels of cache, from the first (the fastest cache of the smallest size - usually 128 KB) to the third (the slowest cache of the largest size - up to tens of MB).

At first, the processor's external cache memory was located on a separate chip. Over time, however, this led to the fact that the bus located between the cache and the processor became a bottleneck, slowing down data exchange. In modern microprocessors, both the first and second levels of cache memory are located in the processor core itself.

For a long time, there were only two levels of cache in processors, but for the first time in the Intel Itanium CPU, a third-level cache appeared, common to all processor cores. There are also developments of processors with a four-level cache.

Architectures and principles of cache operation

To date, two main types of cache memory organization are known, which originate from the first theoretical developments in the field of cybernetics - Princeton and Harvard architectures. The Princeton architecture implies a single memory space for storing data and commands, while the Harvard one has a separate one. Most processors personal computers The x86 line uses a separate cache type. Besides, in modern processors a third type of cache also appeared - the so-called associative translation buffer, designed to speed up address translation virtual memory operating system to physical memory addresses.

Simplified, the scheme of interaction between the cache memory and the processor can be described as follows. First, the presence of the information needed by the processor is checked in the fastest - the first-level cache, then - in the second-level cache, and so on. If necessary information there is no cache in any level, then they say about an error, or a cache miss. If there is no information in the cache at all, then the processor has to take it from RAM or even from external memory(With hard drive).

The order in which the processor searches for information in memory:

This is how the processor searches for information

To control the operation of the cache memory and its interaction with the computing units of the processor, as well as the RAM, there is a special controller.

Scheme of organizing the interaction of the processor core, cache and RAM:

The cache controller is the key link between the processor, RAM and cache.

It should be noted that data caching is a complex process that uses many technologies and mathematical algorithms. Among the basic concepts used in caching, one can single out the methods of writing a cache and the architecture of cache memory associativity.

Cache Write Methods

There are two main methods for writing information to the cache:

  1. Write-back method (reverse write) - data is written first to the cache, and then, when certain conditions, and in RAM.
  2. The write-through method (through writing) - data is written simultaneously to RAM and cache.

Cache Associativity Architecture

The cache associativity architecture defines the way in which data from RAM is mapped to the cache. There are the following main variants of caching associativity architecture:

  1. Direct-mapped cache - a specific area of ​​the cache is responsible for a specific area of ​​RAM
  2. Fully associative cache - any cache area can be associated with any RAM area
  3. Mixed cache (set-associative)

Different cache associativity architectures can typically be used at different cache levels. Direct RAM-mapped caching is the fastest caching option, so this architecture is typically used for large caches. In turn, a fully associative cache has fewer cache errors (misses).

Conclusion

In this article, you got acquainted with the concept of cache memory, cache memory architecture and caching methods, learned how it affects the performance of a modern computer. The presence of cache memory can significantly optimize the performance of the processor, reduce its idle time, and, consequently, increase the performance of the entire system.

All processors since the late 90s have an internal cache (or just cache). The cache is a high-speed memory that transfers instructions and data that are directly processed by the processor.

Modern processors have built-in cache memory of two levels - the first (L1) and the second (L2). With the contents of the L1 cache, the processor is somewhat faster, and the L2 cache is usually slightly larger. The cache memory is accessed without a waiting state, i.e. The first-level cache (on-chip cache) runs at the same frequency as the processor.

This means that if the data needed by the processor is in the cache, then there is no delay in processing. Otherwise, the processor must get data from the main memory, which significantly reduces system performance.

In order to qualitatively understand the principle of operation of cache memory of both levels, let's consider a domestic situation using an example.

You come to the cafe for lunch every day, at the same time, and always sit at the same table. Always order standard set from three courses.

The waiter runs to the kitchen, the chef puts them on a tray and then they bring your order. And so, let's say, on the third day, the waiter, in order not to run to the kitchen once again, meets you at the appointed time with a ready-made hot lunch on a tray.

You do not wait for the order and saved a lot of time. A tray with your dishes is the first level cache. But on the fourth day, you suddenly want to add another dish, let's say dessert.

Although a tray with an order was already waiting for you at the appointed time, the waiter still had to run to the kitchen to get dessert.

And on the fifth - again a menu of three items. On the sixth - again a dessert, but different from the previous one. And the waiter, not knowing what kind of dessert you want to order (and not knowing at all whether you will order anything), decides to take the next step: next to your table he puts a cabinet with several types of dessert.

And if you express a desire, everything is at hand, you don’t need to run to the kitchen. The dessert locker is a second level cache.

The size of the L1 cache (from 16 to 128 KB) and L2 (from 64 KB to 512 KB, up to 4 MB in the Pentium III Cheop and AMD Opteron) significantly affects the performance of the processor.

Intel Pentium III processors and Celeron processors based on it have a L1 cache size of 32 KB. The Intel Pentium 4, as well as Celeron and Cheop versions based on it, have only 20 KB. AMD Processors Duron, Athlon (including XP/MP) and Opteron, as well as VIA S3 contain 128 KB of L1 cache.

Modern dual-core processors have a first-level cache for each core separately, so sometimes we can see the number 128x2 in the cache description. This means that each processor core has 128 KB of L1 cache.

L1 cache size is important to get high performance in most common tasks ( office applications, games, most server applications, etc.). Its effectiveness is especially pronounced for streaming calculations (for example, video image processing).

This is one of the reasons why the Pentium 4 is relatively inefficient for most common applications (although this is compensated by the high clock speed). The L1 cache always works (exchanges information with the processor core) at the internal frequency of the processor.

In contrast, the L2 cache in different models processors operate at different frequencies (and, accordingly, performance). Beginning with the Intel Pentium II, many processors used an L2 cache running at half the processor's internal frequency.

This solution is used in outdated Intel Pentium III processors (up to 550 MHz) and outdated AMD Athlon processors (in some of them, the internal L2 cache worked at a third of the processor core frequency). The amount of L2 cache is also different for different processors.

Older and some newer Intel Pentium III processors have 512 KB L2 cache, while other Pentium IIIs have 256 KB. The Pentium III-based Intel Celeron processor came with 128 and 256 KB of L2 cache, while the Pentium 4-based processor came with only 128 KB. Various variants of the Xeon version of the Intel Pentium 4 have up to 4 MB of L2 cache.

The new Pentium 4 processors (some series with a frequency of 2000 MHz and all for frequencies higher) have 512 KB of L2 cache, the rest of the Pentium 4 have 256 KB. Cheop processors (based on the Pentium 4) have 256 or 512 KB of L2 cache.

In addition, they also have a cache memory of the third level L3. The integrated L3 cache, combined with a fast system bus, forms a high-speed data link to the system memory.

As a rule, only processors for server solutions or special models of "desktop" processors are equipped with L3 cache memory. L3 cache memory is possessed, for example, by such processor lines as Xeon DP, Itanium 2, Xeon MP.

The AMD Duron processor has 128 KB L1 cache and 64 KB L2 cache. Athlon processors (except the older ones), Athlon MP, and most Athlon XP variants have 128 KB L1 cache and 256 KB L2 cache, and the latest Athlon XP (2500+, 2800+, 3000+ and above) have 512 KB L2 cache. The AMD Opteron contains 1 MB of L2 cache.

The latest models of Intel Pentium D, Intel Pentium M, Intel Core 2 Duo processors come with 6 MB L2 cache, and Core 2 Quad with 12 MB L2 cache.

The latest Intel Core i7 processor as of this writing has 64KB of L1 cache for each of the 4 cores, and 256KB of L2 memory for each core. In addition to the cache memory of the first and second levels, the processor also has a common cache for all cores of the third level, equal to 8 MB.

For processors that can have different L2 cache sizes (or in the case of Intel Xeon MP - L3) for the same model, this size must be specified when selling (of course, the price of the processor depends on it). If the processor is sold in a "boxed" package (in-box delivery), it usually indicates the size of the cache.

For normal user tasks (including games), the speed of the L2 cache is more important than its size; for server tasks, on the contrary, volume is more important. The most productive servers, especially those with a large amount of RAM (several gigabytes), require the maximum amount and top speed L2 cache.

Cheop versions of Pentium III processors remain unsurpassed in these parameters. ( Xeon processor The MP is still more productive in server tasks than the Pentium III Xeon, due to the higher clock frequency of the processor itself and the memory bus. , and also allows you to minimize the waiting periods that occur during data processing. The decisive role in this is played by the second-level cache memory located in the processor chip.

What is the dirtiest place on the computer? Think basket? User folders? Cooling system? Didn't guess! The dirtiest place is the cache! After all, it constantly has to be cleaned!

In fact, there are many caches on a computer, and they serve not as a waste dump, but as accelerators for equipment and applications. Where does their reputation as a "systemic garbage chute" come from? Let's see what a cache is, how it happens, how it works and why from time to time.

The concept and types of cache memory

Esh or cache memory is a special storage of frequently used data, which is accessed tens, hundreds and thousands of times faster than RAM or other storage media.

Applications (web browsers, audio and video players, database editors, etc.), operating system components (thumbnail cache, DNS cache) and hardware (CPU L1-L3 cache, GPU framebuffer, etc.) have their own cache memory. chip, drive buffers). It is implemented in different ways - software and hardware.

  • The program cache is just a separate folder or file where, for example, pictures, menus, scripts, multimedia content and other content of visited sites are downloaded. This is the folder where the browser first dives when you open a web page again. Swapping some content from local storage speeds up its loading and .

  • In hard drives, in particular, the cache is a separate RAM chip with a capacity of 1-256 Mb, located on the electronics board. It receives information read from the magnetic layer and not yet loaded into RAM, as well as data that is most often requested operating system.

  • Modern CPU contains 2-3 basic levels of cache memory (it is also called super-rapid memory), placed in the form of hardware modules on the same chip. The fastest and smallest in volume (32-64 Kb) is cache Level 1 (L1) - it runs at the same frequency as the processor. L2 is in the middle position in terms of speed and capacity (from 128 Kb to 12 Mb). And L3 is the slowest and most voluminous (up to 40 Mb), it is absent on some models. The speed of L3 is only low relative to its faster counterparts, but it is also hundreds of times faster than the most productive RAM.

The scratchpad memory of the processor is used to store constantly used data, pumped from RAM, and machine code instructions. The larger it is, the faster the processor.

Today, three levels of caching is no longer the limit. With the advent of the Sandy Bridge architecture, Intel has implemented an additional cache L0 (intended for storing decrypted microinstructions) in its products. And the most high-performance CPUs also have a fourth-level cache, made in the form of a separate microcircuit.

Schematically, the interaction of cache L0-L3 levels looks like this (for example, Intel Xeon):

Human language about how it all works

To understand how cache memory works, imagine a person working at a desk. Folders and documents that he uses all the time are on the table ( in cache). To access them, just reach out your hand.

The papers he needs less often are stored nearby on the shelves ( in RAM). To get them, you need to get up and walk a few meters. And what a person does not currently work with has been archived ( recorded on hard disk).

The wider the table, the more documents will fit on it, which means that the employee will be able to receive fast access to more information the larger the cache capacity, the faster the program or device works in theory).

Sometimes he makes mistakes - he keeps papers on the table that contain incorrect information and uses them in his work. As a result, the quality of his work is reduced ( cache errors lead to software and hardware failures). To correct the situation, the employee must throw away the documents with errors and put the correct ones in their place ( clear cache memory).

The table has a limited area ( cache memory is limited). Sometimes it can be expanded, for example, by moving a second table, and sometimes it cannot (the cache size can be increased if such an opportunity is provided by the program; the hardware cache cannot be changed, since it is implemented in hardware).

Another way to speed up access to more documents than the table can hold is to find an assistant who will serve paper to the worker from the shelf (the operating system can allocate some of the unused RAM to cache device data). But it's still slower than taking them off the table.

Documents at hand should be relevant for current tasks. This is the responsibility of the employee himself. You need to clean up the papers regularly (the extrusion of irrelevant data from the cache memory falls "on the shoulders" of applications that use it; some programs have an automatic cache clearing function).

If an employee forgets to maintain order in the workplace and keep documentation up to date, he can draw a table cleaning schedule for himself and use it as a reminder. As a last resort, entrust this to an assistant (if an application dependent on cache memory has become slower or often loads outdated data, use scheduled cache cleaning tools or do this manually every few days).

We actually come across "caching functions" all over the place. This is the purchase of products for the future, and various actions that we perform in passing, at the same time, etc. In fact, this is everything that saves us from unnecessary fuss and unnecessary body movements, streamlines life and facilitates work. The computer does the same. In a word, if there was no cache, it would work hundreds and thousands of times slower. And we wouldn't like it.

More on the site:

What is a cache, why is it needed and how does it work updated: February 25, 2017 by: Johnny Mnemonic