Memory Speed: The Great Ignored Performance Factor

When you read through stacks of PC reviews, there's always some benchmark they use to compare the performance of each machine. Sometimes you'll see specific statistics on CPU speed, or video speed, but what you never see is a good measurement of the memory access speed. Memory gets ignored, and this is a mistake. Making sure your memory subsystem is optimized correctly is the first and most important thing to do if your PC isn't running quickly enough.

Performance fact: performance in every area (CPU, disk, video) depends on quick memory access.

Bottlenecks

Central to the ideas behind speeding up any system is the notion of a bottleneck. Just like the neck of a bottle slows the rate at which you can pour fluid out it, one part of your system will always slow the rate at which activity can happen. There is no question that something in your computer is the slowest piece, and it's holding back the rate at which work happens. Until you figure out what it is, spending time optimizing is useless; you're just speeding up things that aren't the limiting factor.

Performance fact: changes to things other than the cause of your system bottleneck are a wasted effort.

In the days of older PC systems, you never knew what the bottleneck was going to be. Usually, it was either the processor not going fast enough or the disk drives not keeping up. Intel has a paper called "Performance Factors in a Computer" sitting on their home page. It claims that the processor choice is responsible for 54% of the "speed" of a computer running Windows, while 25% is attributable to the memory (video gets 12% and the disk 9%). This is just plain wrong for today's computers (Intel's test system was a Pentium 60 with 8MB of RAM, totally obsolete at this point). Nowadays, processors are far faster than they used to be. And raw disk speed isn't as much of a factor. Everything is a bit different because of the wide-spread use of caching.

Caching

When you're using your computer, there are things that you access frequently and repetitively. For example, if you're running a DOS based system, you constantly are looking at the computer's disk contents through structures like the FAT table. If you need to read something off the disk, you need to traverse the disk's table of contents to locate where on the disk it is stored first; only then can you read the appropriate sector. Recognizing that the disk drive is spending much of its time reading the same information over and over leads to the idea of caching. A cache is a portion of memory that stores those frequently accessed pieces of information. Since reading memory is far faster than waiting for the drive to spin around until it locates what you want, the information that is in the cache can be fed back to the processor for its use with significantly less delay. Less delay means better performance. Caching theory says that over 90% of disk access is typically to things that were just looked at recently. If you can make 90% of your disk access disappear, you should be flying, right? That's the idea.

Most PC users are familiar with disk caching programs like SMARTDRV. The increased disk usage of programs like Windows made using a disk cache mandatory if you wanted your system to work well. What isn't stressed well enough is that today's processors are so fast that they too would be crippled if it weren't for caching.

Intel has been addressing this problem as their CPU lines progress. Go back a bit to when a 386 running at 16Mhz was a speedy system. Typical FPM DRAM was perfectly capable of keeping up with the memory bandwidth demands of this processor. When the 486 was introduced, it included an 8KB cache inside the CPU itself because that chip could easily outrun the memory under some circumstances. That way, instructions that had been executed recently, like when your computer was in a loop, would run at full CPU speed without needing to go back to the slower memory external to the CPU.

Now, when we move onto Pentium class machines, things get a whole lot uglier. If you've got a CPU running at 100Mhz, that's an instruction clock cycle every 10ns. Even worse, the Pentium design tries to execute multiple instructions at once, so it chews through instructions even faster than that. Obviously, regular memory is nowhere near fast enough to keep up. The 8KB code cache on the chip itself helps (there's also a 8KB data cache), but as programs have gotten bigger over the last few years, it isn't as effective at holding much. Because of this, all good Pentium motherboards include a level 2 cache (the cache inside the CPU itself is the level 1 cache). Typically, the L2 cache is 256KB or 512KB, and runs at 20ns or less. That's still not nearly fast enough to keep the CPU running constantly, but combined with the L1 cache it's an acceptable solution.

The idea here is to give you an idea of the magnitude of the problems. Fast Pentium chips chew through memory very quickly, and if you're throwing around a lot of data, you are going to be at the mercy of the memory subsystem in your computer. It doesn't matter how fast your CPU is if it doesn't have data to work with.

What else uses memory?

Program size isn't the only issue for how much memory you're accessing. Sure, it's a factor, and so is the size of the data you're working with. But you also need to consider the input and output the computer is doing. In a traditional PC design, many of the cards in your system communicate with the rest of the system through a section of memory. For example, a typical VGA card has at least a 128KB buffer that is memory mapped to the card's own memory. This means that when you read or write to a byte in that section of the system memory, the card actually performs that read or write to its own memory. So when you're trying to run that program with the fancy display, all the pixels that are being updated are going through the main system memory as well as to the video card itself. I/O for disk operations work a little bit differently, but the disk cache is all memory based. In both cases, when you need to operate on that part of the system, you need to go through the memory system to get to them. This means that any sluggishness in your motherboard design is not only going to slow the CPU down, it's also going to take your video and disk speed with it.

I stated that memory access speed was the first and most important thing to optimize back at the beginning, and hopefully you see now why that's so. The next thing to get into is how to tell just how fast your memory is working at, and what types of designs might be faster.

Measuring memory speed

Typical benchmarks don't give you a good handle on your memory. It used to be, everyone used measurements like the Norton SI or the Landmark to say how fast their computer was. The problem with most of these benchmarks is that the programs they are running are so small that they fit inside of the processor internal cache. This means that slow memory won't matter; once the program makes its way into the chip, it's off and running without further access to slow DRAM. Ziff-Davis's CPU Mark and the latest versions of Norton SI use tests that thrash around memory enough to reflect a bit of what the memory speed is like, but that's not their primary purpose.

The easiest program to measure memory speed is Wintune 95 from Windows magazine. It's small, you can download a copy from the net or get it on their CD (it shows up on the magazine rack). More information on obtaining one to use was at the end of the benchmarks article. Grab a copy, get Windows 95 or Windows NT running on your system, and you can get a very nice display that shows how the memory system on your computer is working (there's a section below describing how to get similar information for DOS users).

After you run the analyze program, switch to the Chart tab and look at Memory Write Performance. This is the first thing to check, because it's usually the big item that lets you distinguish the good motherboards from the bad. Any motherboard build with one of Intel's modern Triton-style chipsets should get about 84MB/s writing to memory no matter what size block is used. Older motherboards based on the Neptune era chipsets used to write in the 30-40MB/s range. If you're not getting somewhere near 84MB/s, your system isn't writing memory fast enough, and it's slowing everything down significantly.

Now, switch over the Memory Read Performance. Note how performance drops as the size of the memory block accessed goes up. See where the big jumps are? They should match up with the cache sizing on your system. You should be getting upwards of 500MB/s on the 4K and 8K blocks, because they are sitting in the CPU's internal level 1 cache. The blocks from 16K up to the size of your level 2 cache should get somewhere around 180MB/s. Transfers bigger than the L2 cache need to go back to the actual DRAM itself, and this access typically happens at around 90MB/s. Look at those numbers. The internal CPU cache is well over five times as fast (maybe even close to ten times as fast) as the external memory access is. Hopefully, the L2 cache sits between these two in performance. If your motherboard doesn't implement the cache well, you should see that here. Unlike the write performance, the read performance greatly depends on the speed of the CPU itself. The level 1 cache inside the CPU itself is what's being tested with the smaller blocks, and that speed is totally dependent on how fast that CPU runs at. There's also a copy statistic, but it's not all that useful; you can pretty much predict what it's going to be by looking at the read and write speeds, having them summarized into one figure blurs the things you want to know.

If you switch to the tab for Memory I/O performance, you'll find a summary of the characteristics of your system. What's good and what's bad? Two ways to tell. You can move back to Wintune's database and compare your machine to others in the same class as your own. The danger with that approach is that not every one of those machines is necessarily a good performer, so you may very well be comparing your system with one that's a dog. Well, since I know what's good and bad, here's a little table I've created to summarize the average characteristics of good performing motherboard with Intel Triton-class chipsets running Intel Pentium processors:

ClockChip MHzReadWriteCopy MB/s
50751206137
661001778354
661332358359
601502437553
661662738360
662003108461

One thing to notice here is that I'm including the bus clock speed in addition to the processor speed. The motherboard has a clock it runs at, typically 50, 60, or 66Mhz for modern Pentium designs. The CPU itself uses a multiplier that gets it to execute multiple cycles for every stroke of the bus clock. Notice that memory write speed is very much proportional to that external bus speed. If you slow down the bus that clocks the access to external memory, obviously you're not going to be able to write to it as quickly. Because of this, you can see that a Pentium 150 is in most aspects slower than a Pentium 133. Similar reasons make a P120 slower than a P100. You want the fastest bus speed possible, because you take a hit on both writing and reading if it's slower.

Read speed is almost directly proportional to CPU speed. The slight deviation from linear is because the external memory speed is also factored into this calculation, in the form of the speeds for the blocks larger than the L2 cache. These numbers tend to be fairly consistent even with different motherboards. Because the average speed of the reading is swamped by the 4K and 8K results (which are CPU based), the rest of the memory subsystem doesn't quite impact this as much as it does writing.

Windows? Who needs it!

If you can't or don't want to use Wintune 95, there is a DOS program called cachechk (available for download from the Computer Nerd site at http://www.c omputernerd.com/programs.htm) that gives you all the same memory access data. Boot your system with a clean setup, run cachechk, and you can get all sorts of statistics about how big and efficient the caches on your system are. cachechk gives its performance statistics in us/KB. To convert this to the same system Wintune uses, MB/s, divide 1000 by the us/kB number (the numbers are summarized in MB/s at the end of the program output). Compared with Wintune, cachechk takes longer to run (because it actually checks out access speed to all of your memory, instead of a just a portion), but can be more accurate and detailed. I prefer Wintune because the database features make it so much easier to compare computers and find trends in the graphs instead of sifting through a bunch of benchmark numbers.

What are the major chipsets available?

Intel's first major push toward making motherboard chipsets was with the Neptune chipset. While these were very good at the time, the whole design for the CPU to memory interface was not optimal. The Neptune boards were typically installed in Pentium 90 and 100 machines; using Wintune on them benches their Memory Write Performance in the 30-40MB/s range. Intel's next generation chipset was named the Triton, and it's that design that really introduced the better L2 cache systems that implement synchronous and pipeline burst caching. At the same time, Triton supported newer EDO memory and better memory access timing. Much was made of EDO as a performance booster, and it was claimed that this was the reason that Triton chipset motherboards were so much faster than earlier designs. This just ain't so; my own testing shows that the performance advantage of EDO over the older FPM DRAM in a well implemented Triton class board is barely measurable. Sure, in systems without cache, EDO soundly whips FPM. It's back to the whole 90/10 idea behind caching; the L2 cache improves the caching that is 90% of the performance, while faster RAM like EDO is only working on the 10% area. The real reason Triton boards were faster is the improved cache and memory timings, and you can usually get that boost without upgrading your SIMMs. It's not a bad idea to upgrade to EDO (or faster) RAM while you're at it, though, because you will have to throttle back some of the performance features of the newer motherboards in order to use the older RAM with it.

Since the first Triton release (which is the 430FX chipset), Intel has produced two more chipsets in the Triton series. During that time, the company decided to switch to numeric coding instead of continuing to use mythological names. The 430HX chipset (sometimes called the Triton 2 by motherboard makers) provides somewhat enhanced performance by streamlining the entire memory subsystem. It's aimed more at the corporate market, and usually is expandable to high amounts of memory (typically 512MB). The latest 430VX chipset (sometimes called the Triton 3) is aimed more at the home market. The enhancements on it are supposed to improve multimedia performance, at the expense of memory capacity (typically these designs only support 128MB) and a bit of performance with EDO memory. Both newer chipsets have support for the Universal Serial Bus (USB), and in some motherboard designs the 430VX lets you use the new SDRAM style memory which improves considerably on older DRAM parts. Performance on both these chipsets is somewhat better than the older Triton chipset, hovering around the noticeable but not especially significant category. An across the board boost in performance of about 5% seems to be typical.

What about other processors and motherboards?

Intel also has the newer Pentium Pro available. Besides some optimizations for 32 bit code, the major reason Pro CPUs run faster is that they have the L2 cache integrated into the chip itself. Intel knows perfectly well that the current implementation of memory systems in normal designs can just barely keep up with a normal Pentium, and the Pentium Pro would have its performance crippled if it weren't for this change (faster L2 cache systems are due to start trickling out soon to address this problem when Pro processors without built-in cache appear). What this closer coupling between CPU and cache results in is incredible memory read and write speed, close to 10x as fast on small blocks. Once you get outside of the L2 cache range, everything drops to the same external DRAM speeds you see with normal designs. This fast memory also makes for extremely good cached disk performance, although all this usually results in is moving the bottleneck away from memory access and back to how fast the drives can move. Pentium Pros are just all around better performers, except for some of that ugly older code. I feel prices are currently reasonable, but they are certainly no bargain.

You know, Intel isn't the only company making motherboard components, although it does seem that way sometimes. Recently in particular, some of Intel's competitors have been releasing systems that are very competitive from a performance standpoint.

Cyrix has released a series of Pentium compatible CPUs that claim to have better performance relative to their clock speed than Intel's chips do. The media has been somewhat confused as to why that is; after all, many of the traditional benchmarks show that the chip performs slower. Nonetheless, on real application benchmarks, Cyrix does very well. If you look at a Cyrix 686 system with Wintune, bearing in mind the discussion here, it's obvious why this is. The entire memory write performance is considerably higher than comparable Intel chips. In particular, the caching inside the chip itself works with writes as well as reads, so that the Cyrix chip can write 4K blocks over 3 times as fast as Pentium chips do (this is approximately half the performance of the Pentium Pro in that category). Simply by improving the whole CPU to L2 cache interface, the entire system runs considerably faster, even though everything else about the chip (like reading memory and floating point) is slower than a typical Pentium. Don't be fooled by claims that the only reason the Cyrix systems are faster are because they tend to use a better video card or disk system than the Pentium systems they were compared with. The real reason is the write performance, which as we've already discussed is critical to making video and disks work well.

I can't say I recommend Cyrix chips overall. I'm less than completely convinced that total compatibility is there (read the Cyrix 6x86 Guide for information about problems with NT 4.0). Plus, the trick of using the superior write cache doesn't quite make up in my mind for the inferior integer and floating point speed, especially considering how complicated many of things I do are in those areas (Quake comes to mind as something you'll find Cyrix owners complaining about, because of the poor floating point).

Another company that has been keeping competitive with Intel is VIA. Their Apollo chipsets are a competitor to the Intel Triton series, and in many performance aspects can be superior. The major supplier of motherboards based on VIA's chips is FIC, in case you want to go looking for more information.

OK, so what should I buy?

The whole reasons we've gone through this territory is to become better informed consumers. Ultimately the question that needs to be answered is what the best performing choice available is, along with some consideration of price sensitivity and what you currently own.

If you run through Wintune, and the results you're getting are significantly lower than the ones in the table I give, you should consider upgrading your motherboard and possibly your CPU. A motherboard based on either the 430HX or 430VX will be an extreme boost in performance for your system. If you've already got a Triton class motherboard, the later chips aren't quite worth upgrading to in my mind. I'd wait for the next generation of motherboard (possible optimized for things like MMX technology) before spending the cash for what is only a marginal improvement. If you already have a Triton era motherboard, and your results are dismal, make sure you actually have L2 cache, that it's enabled, and that the rest of the settings in your BIOS are configured for good performance. This can become its own adventure.

If you have anything less than a Pentium 100, now is the time to upgrade, along with that new motherboard. Pentium 100s are the price/performance leader. You get the benefits of the fastest normal clock speed around (66Mhz), and the chips are very cheap at the moment. A P133 is still a good value, with a performance increase almost proportional to the additional cost. P166 and P200 are just too expensive right now to justify for most buyers, when MMX chips are just around the corner. I recommend not going over the 133 unless you're really hurting for a performance boost. The price gouging that you'll see for the faster chips is just not worth it right now, when the performance increase is so small for real applications (the faster chip often just spends more of its time waiting for memory access).

Make sure you've got a modern motherboard, and that whatever CPU you use is using the maximum bus speed possible (even if that means you have to drop the actual CPU speed down a notch). That's the advice I give to those unsatisfied with their current performance. Real performance enthusiasts will need to approach things differently, but that's a topic for another time.


Back to the Fast PC page.
Copyright 1996 Gregory Smith (gsmith@westnet.com).