### Linear Regression of Sampling Distributions of the Mean

#### Article Information

**David J Torres ^{*}, Ana Vasilic, Jose Pacheco**

Department of Mathematics and Physical Science, Northern New Mexico College, Española, New Mexico, USA

** ^{*}Corresponding author:** David J Torres. Department of Mathematics and Physical Science, Northern New Mexico College, Española, New Mexico, USA

**Received: **12 December 2024;** Accepted**: 19 December 2024;** Published**: 04 March 2024

**Citation:** David J Torres, Ana Vasilic, Jose Pacheco. Department of Mathematics and Physical Science, Northern New Mexico College, Española, New Mexico, USA. Journal of Bioinformatics and Systems Biology. 7 (2024): 63-80.

#### Abstract

We show that the simple and multiple linear regression coefficients and the coefficient of determination R2 computed from sampling distributions of the mean (with or without replacement) are equal to the regression coefficients and coefficient of determination computed with individual data. Moreover, the standard error of estimate is reduced by the square root of the group size for sampling distributions of the mean. The result has applications when formulating a distance measure between two genes in a hierarchical clustering algorithm. We show that the Pearson R coefficient can measure how differential expression in one gene correlates with differential expression in a second gene.

#### Keywords

Linear regression, Pearson R, Sampling distributions of the mean.

Linear regression articles; Pearson R articles; Sampling distributions of the mean articles

#### Article Details

#### 1. Introduction

Linear regression coefficients and the Pearson R correlation has long been used to quantify the relationship between dependent and independent variables [1]. However, the “ecological fallacy” has shown that linear regression and correlation coefficients based on group averages cannot be used to estimate linear regression and correlation coefficients based on individual scores [2, 3].

It may not be well known that if all possible groups are considered, in the case of sampling distributions of the mean, the Pearson R coefficient computed from the group averages is equal to the Pearson R coefficient computed from the original individual scores for one independent variable [4, 5]. We extend this result and show that the linear regression coefficients (for simple and multiple regression) and the coefficient of determination R^{2} computed from sampling distributions of the mean (with or without replacement) are the same as the coefficient of determination and linear regression coefficients computed with the original individual data. The sampling distributions of the mean can also be constructed using differences between two groups of different size. The result has implications for hierarchical clustering of genes. Specifically, the Pearson R coefficient can be used to measure how differential expression in one gene correlates with differential expression in a second gene.

The standard error of estimate is a measure of accuracy for the surface of regression [6]. Using the coefficient of determination, we show that the standard error of estimate is reduced by the square root of the group size for sampling distributions of the mean.

In Section 1, we recall and reformulate the system of equations that are solved to determine the linear regression coefficients for individual scores. In Section 2, we prove the assertion that the same system of equations needs to be solved for sampling distributions of the mean with and without replacement or differences between two groups of sampling distributions. In Section 3, we show that the coefficient of determination is the same whether it is calculated using individual scores or all possible group averages from sampling distributions. Section 4 shows that the standard error of estimate is reduced by the square root of the group size for sampling distributions of the mean. Section 5 performs numerical simulations to illustrate these principles. Section 6 applies these results and shows that the Pearson R coefficient can be used to measure how differential expression in one gene correlates with differential expression in a second gene when the z-statistic is used.

#### 2. Computing regression coefficients

Multiple regression requires one to compute the coefficients that minimize the sum of squares

where x^{(1)}, x^{(2)}, ..., x^{(K) }represent K different independent variables, y is the dependent variable, and the *i ^{th}* realization of variables x

*and y are x*

^{(j) }*and y*

_{i}^{(j)}*, respectively,*

_{i}*1 ≤ i ≤ N*. Note that for simple linear regression, K = 1. We recast the sum (1) in the form

where

and the coefficients are related by

To solve for *β*_{0}, we set the partial derivative of (2) with respect to *β*_{0} to zero which yields

However

which implies that *β*_{0} = 0. Thus one can redefine the problem of computing multiple regression coefficients (1) to be the selection of the coefficients that minimizes the sum of squares

In the matrix approach to minimizing the sum of squares which can be derived by setting the partial derivatives of (8) to zero, the system of equations in (8) is written in matrix form [8]

where

and **X** is a N by K matrix whose entries are:

One can solve for the multiple regression coefficients in the vector ** β **by left multiplying (9) by the transpose

**X**

*and solving the linear system*

^{T}The elements in the square *K *by *K *matrix **X**^{T} **X **will be sums of the form

Similarly the entries in the K by 1 vector **b** ≡ **X*** ^{T}* (

**y**−

**y**

**¯**) will be sums of the form

It should be noted that for each pair of fixed indices *i *and *j*, the sum in either expression (11) or (12) can be represented using a sum of the form

In the following section, we show that if the variables *x*^{(1)}*, x*^{(2)}*, ..., x*^{(k)}, and *y *are replaced with all the elements from the sampling distributions of the mean, the system (14) is obtained

for some constant *α*. Moreover, we obtain a closed form for the constant *α*. If *m *is the group size and we account for order, the size of the matrix **X **will be *N ^{m }× *

*K*for selections with replacement and for selections without replacement. However, the resulting system (14) will still be a K × K system. Since the system (14) is equivalent to the system (10), the regression coefficients for the sampling distributions of the mean will be the same as the regression coefficients computed from the original data according to (5) for

*1 ≤ j ≤ K*. The equivalence of β

_{0}

^{∗}follows from

*β*

_{0}= 0, equation (4), and the fact that the means of the original data (

*y*¯ and

*x*¯

^{(j)}) are the same as the means computed using all the elements from the sampling distributions of the mean (with or without replacement). If we assume that there are

*N*elements in the sampling distribution, this can be stated mathematically as

_{p}or

where *w** *can represent *y** *or *x*^{(}^{j}^{)} and The sum ∑* _{P}* is a sum over all possible index values in the sampling distribution.

#### 3. Regression with averages

Let us create elements from the sampling distribution of the mean using elements chosen from the groups

by averaging all possible groups of size m_{1} and size m_{2 }

chosen from the sets U and V . We assume without loss of generality that m_{1} ≥ m_{2}. The first *m _{2}* choices are paired

while the remaining choices remain unpaired

If the selections are done without replacement, *p _{r }*≠

*p*if

_{s}*r*≠

*s*. However, if the selections are formed with replacement,

*p*can equal

_{r}*p*.

_{s}Let us now replace *u _{p} *and

*v*in (13) with all possible averages of

_{p}*m*

_{1}and

*m*

_{2}elements as shown in (18),

where

for selections with replacement and

for selections without replacement where is used to denote the exclusion of previously chosen indices

The means of the original scores are used in (19) since they are equal to the means of the sampling distributions by (15). Note that order matters in the way the sums are written in (20-21). For example, (*u _{1}, u_{2}, u_{3}*) is considered a different choice then (

*u*). Disregarding order would lead to

_{3}, u_{2}, u_{1}*m*

_{1}! fewer terms in (21). However the same system (14) would be generated if order was not considered for sampling distributions without replacement.

Factoring out from (19) yields

Sections 2.1 and 2.2 show that *S _{u}*

_{¯,v¯}will be equal to a factor

*α*times

*S*as defined by (13) for sampling distributions with and without replacement respectively. Section 2.3 generalizes these results to differences of two groups of sampling distibutions. In all cases, the elements of the matrix

**X**and the vector

^{T}X**X**

^{T}**(y-y¯)**will be multiplied by the same factor α when elements from the sampling distributions of the mean are used.

**3.1 Sampling distributions with replacement **

Start with *S*_{u}_{¯}_{,}_{v}_{¯} as defined by (23). Since we are considering sampling distribution with replacement, the values chosen for summation indices *p _{i} *do not need to be different. We will show that

If we distribute the sums inside the parentheses in (23), two types of terms are formed. The first type of term takes the form

where the same summation index *p _{i}* is used for u

_{pi}and v

_{pi}. The second type of term takes the form

where different summation indices, *p _{i }*and

*p*are used.

_{j }All the terms of the form shown in (26) are zero since when the sums moved to apply directly to each term can be summed independently

However

as noted by equation (7). Thus we must only consider terms of the form (25). The sum (20) acting on can be rearranged as

and simplified to

since each sum contributes a factor of N. Multiplying the right side of (28) by m_{2} to ensure all summation indices p_{i, }i = 1,…,*m _{2}* are accounted for and multiplying by the factor present in (23) yields (24). One can also derive (24) using random variables and expected values.

Since *S*_{u}_{¯}_{,}_{v}_{¯} is a multiple of *S* defined by (13), the system of equations (14) will be formed where when we set* m* =* m _{1}* =

*m*. Thus the multiple regression coefficients

_{2}*β*computed from sampling distributions of the mean with replacement will be equal to the multiple regression coefficients computed from the original scores.

**3.2 Sampling distribution without replacement **

We will show that

for sampling distributions created without replacement. If we distribute the sums inside the parentheses of (23), we again distinguish between two types of terms: terms of the form and terms of the form

Choose a summation index p_{i}. The sum (21) applied to can be written as

where the sum with summation index *p*_{i} is placed first and the term

excludes previously chosen index values and the index value chosen for *p _{i}*. The right side of (30) can be simplified to

since the choice made for *p _{i}* in the first sum leaves

*N*− 1 choices for the second sum,

*N*− 2 choices for the third sum, up to

*N*− (

*m*

_{1}− 1) choices for the last sum.

Moreover there are *m*_{2} terms similar to (30) for each summation index, *p _{i}*. Thus the terms of the form contribute

to the sum *S*_{u}_{¯}_{,}_{v}_{¯}.

We now consider terms of the form with two different summation indices *p** _{i}* and

*p*

*. The sum (21) applied to can be written as*

_{j}where the sums with summation indices *p _{i} *and

*p*are placed first and the term

_{j}excludes previously chosen index values and the index values chosen for *p _{i}* and

*p*. Equation (32) can be simplified to

_{j}since the choice made for *p _{i}* in the first sum and

*p*in the second sum leaves

_{j}*N − 2*choices for the third sum,

*N − 3*choices for the fourth sum, up to choices for the last sum. Moreover there are sums of the form (32) that can be identified when the terms in (23) are distributed. Therefore the terms of the form contribute

sampling distributions of the mean without replacement will be equal to the multiple regression coefficients computed from the original scores.

**3.3 Difference between two groups **

The results in Sections 2.1 and 2.2 generalize to a difference of two groups of sampling distributions. Let *m*_{1} be the size of Group 1 and *m*_{2} be the size of Group 2. The two groups can be composed to allow or exclude common elements.

**3.3.1 Group 1 and Group 2 can share elements**

Consider the expression *S _{d} *shown in (34). The sum ∑

_{Q }composed of

*m*iterated sums is essentially the same sum shown in either (20) or (21) except that the indexing is done with

_{2}*q*instead of

*p*. We first examine the case where Group 1 and Group 2 can share elements: i.e.

*p*may be equal to

_{i}*q*.

_{j}The expression for A can be derived by recognizing that N choices are available for each of the *m _{2}* sums in ∑

*when the selections are made with replacement. When the selections are made without replacement, there are N choices for the first sum,*

_{Q }*N*– 1 choices for the second sum, up to for the

*m*

_{2}'th sum. The same reasoning can be used to derive the expression for B. By (16), the second and third terms in (37) are zero and can be eliminated. Using equations (24) and (29), one can simplify (37) to

coefficients ** β **computed from a difference of two groups of sampling distributions of the mean will be equal to the multiple regression coefficients computed from the original scores.

**3.3.2 Group 1 and Group 2 do not simultaneously share elements**

We also consider the case where Group 1 and Group 2 do not simultaneously share any elements. We assume the selections are done without replacement. Under these restrictions, one can write (36) as

the coefficient of determination*R*unchanged when elements of the sampling distribution of the mean or differences of two groups of sampling distributions of the mean are used. For one independent variable, R will have the same sign for sampling distributions of the mean and individual scores since R shares the same sign as the linear regression slope.

^{2}#### 5. Standard error of estimate

The standard error of estimate is a measure of accuracy for the surface of regression [7]. In this section, we show the

**Figure 1:** Original data (x_{i}, y_{i}), all elements from the sampling distributions of the mean, and the shared linear regression line. The red circles are the original 15 points, the blue squares are the averaged data of size *m* = 2, and the green asterisks are the averaged data of size *m* = 3 without replacement.

is the population variance. Now by (53) *SSE *= (*N *− 2)*s _{e}*

^{2}

*.*Replacing SSE in (57) with (

*N*− 2)

*s*

_{e}^{2}and solving for

*s*yields

_{e}#### 6. Numerical simulations

Figure 1 plots the original data in red and all elements from the sampling distribution of the mean generated without replacement for groups of size *m* = 2 in blue and *m* = 3 in green for* N* = 10 original points. The original data and elements from the sampling distribution of the mean share the same regression line and coefficient of determination *R ^{2}*. The elements of the sampling distribution of the mean are clustered more closely about the regression line compared to the original data which is consistent with (59) and (61).

Figure 2 plots the original data in red and all elements from the sampling distribution of the mean generated without replacement for groups of size

**Figure 2:** Original data (*x _{i}, y_{i}, z_{i}*)

*, i*= 1

*,*15 in red and all elements from the sampling distribution of the mean for

*m*= 2 (blue) and

*m*= 3 (green), and the shared linear regression plane.

*m = 2* in blue and *m = 3* in green for *N = 15* original points. The original data and elements from the sampling distribution of the mean share the same regression plane *z *= *.*653*x-*712*y *and coefficient of determination *R*^{2} = 0*.*76. For visualization purposes, the normal distance to the plane is plotted as the z-coordinate and the multiple regression plane is aligned with the *z *= 0 plane.

Figure 3 shows the convergence of sampling distributions of the mean for{(*x _{i}, y_{i}*)

*, i*= 1

*,*2

*, ..., N*},

*N*= 11 scores with Pearson correlation coefficient

*R*= 0

*.*35 and linear regression slope

*β*

_{1}= 0

*.*27. In the first simulation shown in black and red, elements from the sampling distributions of the mean are created using groups of size

*m*= 5 without replacement. In the second simulation shown in blue and green, elements from the sampling distributions of the mean are created using differences of two groups of size

*m*

_{1}= 4 and

*m*

_{2}= 2 without replacement. The horizontal axis plots the fraction of total selections used in the sampling distributions. There are 11

^{5}= 161

*,*051 total selections for the first simulation and 11!

*/*5! = 332

*,*640 total selections for the second simulation. The vertical axis plots the base 10 logarithm of the absolute difference. The absolute difference can be between either the Pearson R = 0.35 based on individual scores and the Pearson R computed from a fraction of the elements from the sampling distributions of the mean (black and blue graphs), or between the linear regression slope

*β*

_{1}= .27 based on the original scores and the slope computed using a fraction of the elements from the sampling distributions of the mean (red and green graphs). While not entirely obvious due to the density of points, all differences decrease from approximately 10

^{−6}to less than 10

^{−13}in the last 0.001% of the total selections. In addition, the differences do not always decrease monotonically as the fraction of total selections increase, and the differences decrease to very small values (less than 10

^{−5}) at certain points during the course of the convergence as noted by the downward spikes.

#### 7. Gene expression and distance between genes

A useful way of organizing the data obtained from microarrays or RNA-seq data is to group together genes that exhibit similar expression patterns through hierarchical clustering. A hierarchical clustering algorithm generates a dendrogram (tree diagram). However, the algorithm requires that a distance be defined to quantify similarities in expression between two individual genes.

**Figure 3:** Convergence of the Pearson *R* and linear regression slope for sampling distributions of the mean.

Let *A _{i} *denote the expression level of gene A for patient i and let

*B*denote the expression level for gene B for patient

_{i}*i, 1≤ i ≤ N*. Distances between genes can be computed using many metrics [9], but two common ones are the Euclidean distance

The purpose of the next section is to propose a new distance based on the differential expression of two genes. We then show the new measure of distance is the same as the Pearson R coefficient computed from the original scores (64), thus lending support to the use of the Pearson R coefficient in measuring the distance between two genes.

**7.1 Formulating a new distance between two genes**

The new distance will now be defined as *D _{T}* or alternatively

*D*

*′*

_{T}#### 8. Conclusion

We have shown that the linear regression coefficients (simple and multiple) and the coefficient of determination *R ^{2}* computed from sampling distributions of the mean (with or without replacement) are equal to the regression coefficients and coefficient of determination computed with the original data. This result also applies to a difference of two groups of sampling distributions of the mean. Moreover, the standard error of estimate is reduced by the square root of the group size for sampling distributions of the mean.

The result has implications for the construction of hierarchical clustering trees or heat maps which visualize the relationship between many genes. These processes require one to define a distance between two genes using their expression levels. We developed a new measure of distance based on how differential expression in one gene correlates with differential expression in a second gene using the z-statistic. We showed that the new measure is equivalent to the Pearson R coefficient computed from the original scores, thus lending support to the use of the Pearson *R* coefficient for measuring a distance between two genes.

**Funding**

This is research is supported by an Institutional Development Award (IDeA) from the National Institute of General Medical Sciences of the National Institutes of Health under grant number P20GM103451. The content is solely the responsibility of the author and does not necessarily represent the official views of the National Institutes of Health.

**References**

- Yan X. Linear regression analysis: Theory and computing. Singapore: World Scientific (2009).
- Robinson WS. Ecological correlations and the behavior of individuals. Int J Epidemiol 15 (2009): 351-357.
- Goodman LA. Ecological regressions and the behavior of individuals. Am Sociol Rev 18 (1953): 663-664.
- Kenney JF. Mathematics of statistics, Part Two (1947).
- Pearson K. On the probable errors of frequency constants. Biometrika 9 (1913): 1-10.
- Pagano R. Understanding statistics in the behavioral sciences, 10th ed. Belmont, California: Wadsworth Publishing (2012).
- Sullivan M III. Fundamentals of statistics, 3rd ed. New York: Pearson (2010).
- Walpole RE and Myers RH. Probability and statistics for engineers and scientists, 3rd ed. New York: MacMillan Publishing Company (1985).
- D’haeseleer P. How does gene expression clustering work? Nat Biotechnol 23 (2005): 1499-1501.
- Quackenbush J. Computational analysis of microarray data. Nat Rev Genet 2 (2003): 418-427.
- Eisen MB, Spellman PT, Brown PO, et al. Cluster analysis and display of genome-wide expression patterns. Proc Natl Acad Sci USA 95 (1998): 14863-14868.
- Triola MF. Elementary statistics using the TI-83/84 Plus calculator, 3rd ed. New York: Addison-Wesley (2011).