Bit errors occur about once a week in 4GB RAM due to the background radiation. From 2% to 15% of these errors lead to faulty calculations, system crashes or unpredictable behavior. Consequently there’s one serious incident in the computer system every year, at the lowest estimate.
The problems have been known for many years and the solution is Error Correcting Code, ECC. The use of ECC memory has been limited to only high-end processors until now. The latest low-power processors from AMD and Intel not only introduce one-chip architectures but also ECC memory support to a much wider range of applications. Prices for ECC memories are expected to decrease as the use of the technology are broader and production volumes increase.
A bit error in memory is the spontaneous shift from correct to incorrect state. Causes are separated into two groups. So called hard errors are bit errors occurring from for instance great temperature variations or mechanical stress. Soft errors are due to the background radiation. Soft errors are several decimal factors more frequent than hard errors.
Detecting and correcting errors using 8-bit check sum
ECC memory involves the calculation of an 8 bit check sum when data is written to the RAM memory. The result is saved along with the 64 bits of data from which it was calculated. A part of the process to read data (and the check sums) from RAM is that the processor calculates the check sum once again. The additional calculations in a system using ECC memory decreases performance by 2% to 4% compared to a non-ECC system.
A field study published 2009 (DRAM Errors in the Wild: A Large-Scale Field Study) was made using thousands of Google servers. Results included mean correctable error rates of 2000-6000 per GB and year.
The check sum calculated by the processor is compared with the check sum read from memory. The result of the comparison is used to detect and correct single bit errors. The algorithm is able to detect two bit errors (side by side) as well but can only correct one of them.
The support for ECC memory has been most common for high-end processors. The use of this functionality for detecting and correcting memory errors has been most common in server, financial and banking applications where the acceptance for errors is extremely low.
ECC memory support in processors from Intel and AMD
The releases of low power Intel Atom Processor 3800 series and AMD Embedded G-Series SOC and now the latest E3900 series (Skylake) both offer support for ECC memory. Thus ECC support is offered in processors in the 20 dollar segment and ECC memory is thereby introduced in a large and completely new market segment.
Medical technology and aerospace are two relevant application areas for ECC memory. In both areas strict requirements for safety and reliability are crucial. Aerospace is likely to suffer from additional challenges since radiation increases on high altitudes. Both application areas are subject to safety standards and regulatory approval. Additional application areas where ECC memory may be used for safety reasons are oil and gas, marine, offshore, rolling stock and transportation where embedded computers are vital for safety.
The price difference comparing RAM memory with and without ECC memory is small for large densities. The reason is that RAM in large densities are aimed at server applications where today ECC memory is more or less always used. Production volumes for ECC memory modules are large and prices are therefore competitive.
Expect to find bigger price differences for densities and form factors (<16GB, SO-DIMM or soldered) commonly used in embedded applications. The additional cost for ECC support in RAM has decreased some recently and can be expected to decrease further in the future since AMD and Intel both have introduced the ECC opportunity in their low-power SOCs for the embedded segment.
Targeting new application areas
Decreasing prices and increasing availability for ECC memory aimed at the embedded segment will promote the use in applications outside the strict safety critical areas. ECC could be used to avoid miscalculations as forest harvesters are logging trees while collecting data for billing purposes. The possibility to avoid at least a couple of system crashes, the time, cost and badwill associated with them during the lifetime of the product is possibly enough motivation to introduce ECC in the application.
We would like to increase the awareness of memory bit errors, causes, effects and solutions to the problems. The belief is that ECC memories will be of interest to new types of applications, additional customers and that production volumes will increase and prices drop. The driving forces behind this scenario is new low-power processors supporting ECC from the two big X86 processor manufacturers and the fact that there’s as a result of that processor platforms in the 20 dollar segment supporting the functionality. One memory bit error a week and possibly a couple of serious incidents every year may be avoided. It’s why it is well worth to evaluate the investment in ECC in your application.