Article ID: 828129 - View products that this article applies to.
This article describes the PEARSON function in Microsoft Excel. This article discusses how to use the function. Additionally, this article compares results of PEARSON in Microsoft Office Excel 2003 and in later versions of Excel with the results of PEARSON in earlier versions of Excel.
The PEARSON(array1, array2) function returns the Pearson product-moment correlation coefficient between two arrays of data.
The arguments, array1 and array2, must be either numbers, names, array constants, or references that contain numbers.
The most common use of PEARSON includes two ranges of cells that contain the data, such as PEARSON(A1:A100, B1:B100).
Example of usageTo illustrate the PEARSON function, follow these steps:
In versions of Excel that are earlier than Excel 2003, PEARSON may exhibit round-off errors. The behavior of PEARSON has been improved in Excel 2003 and in later versions of Excel. CORREL has always been implemented with the improved procedure that is now used in Excel 2003 and in later versions of Excel. Therefore, if you are using PEARSON for a version of Excel that is earlier than Excel 2003, Microsoft recommends that you use CORREL instead.
In versions of Excel that are earlier than Excel 2003, you can use the worksheet in this article to run an experiment and discover when round-off errors occur. If you add a constant to each of the observations in B1:B6, the value of PEARSON or CORREL should not be affected. If you increase the value in D2, a larger constant is added to B1:B6. If D2 is less than 7, there are no round-off errors that appear in the first six decimal places of PEARSON. Now change the value of D2 to 7.25, 7.5, 7.75, and then 8. Cells D6:D13 of the worksheet show values of PEARSON and CORREL when D2 = 7.5 and when D2 = 8, respectively.
CORREL is still accurate, but round-off errors in PEARSON have become so severe that division by 0 occurs when D2 = 8.
Earlier versions of Excel exhibit incorrect answers in these cases because the effects of round-off errors are more profound with the computational formula used by these versions. Still, the cases in this experiment can be viewed as extreme.
In Excel 2003 and in later versions of Excel, you should see no changes in values of PEARSON if you try the experiment. However, cells D6:D13 show the same round-off errors that you obtain in earlier versions of Excel.
Results in earlier versions of ExcelIf you name the two data arrays X's and Y's, earlier versions of Excel use a single pass through the data to compute the sum of squares of X's, the sum of squares of Y's, the sum of X's, the sum of Y's, the sum of XY's, and the count of the number of observations in each array. These quantities are then combined in the computational formula in the Help file in earlier versions of Excel.
Results in Excel 2003 and in later versions of ExcelThe procedure that is used in Excel 2003 and in later versions of Excel uses a two-pass process through the data. First, the sums of X's and Y's and the count of the number of observations in each array are computed. From these, the means (averages) of X and Y observations can be computed. Then, on the second pass, the squared difference between each X and the X mean is found; these squared differences are summed. The squared difference between each Y and the Y mean is found; these squared differences are summed. Additionally, the products (X – X mean) * (Y – Y mean) are found for each pair of data points and are summed. These three sums are combined in the formula for PEARSON. None of these three sums are affected by adding a constant to each value in the Y array (or the X array), because that same value is added to the Y mean (or the X mean). In the numeric examples, even with a high power of 10 in cell D12, these three sums are not affected and the results of the second pass are independent of the entry in cell D2. Therefore, the results in Excel 2003 and in later versions of Excel are more stable numerically.
ConclusionsA two-pass approach guarantees better numeric performance of PEARSON in Excel 2003 and in later versions of Excel than the one-pass approach that is used in earlier versions of Excel. The results that you obtain in Excel 2003 and in later versions of Excel will never be less accurate than results that you obtained in earlier versions of Excel.
CORREL has the same functionality and has always been implemented with the approach that is used for PEARSON in Excel 2003 and for later versions of Excel. Therefore, CORREL is a better choice for earlier versions of Excel.
In most practical examples, however, you are not likely to notice a difference between the results in Excel 2003 and in later versions of Excel and the results in earlier versions of Excel. Typical data is unlikely to exhibit the kind of unusual behavior that this experiment illustrates. Numeric instability is most likely to appear in earlier versions of Excel when data contains both a high number of significant digits and relatively little variation between data values.
The procedure that finds the sum of squared deviations about a sample mean by finding the sample mean, by computing each squared deviation, and then by summing the squared deviations is more accurate than the alternative procedure. (The alternative procedure is frequently referred to as the "calculator formula," because it is suitable for use by a calculator on a small number of data points.) The alternative procedure consists of the following steps:
A short list of such functions includes the following functions: