【导语】无忧考网整理“CFA注册金融分析师考试仿真试题第三套”更多考试专题，模拟试题等内容，请访问无忧考网注册金融分析师考试频道。

三、 Investment Tools: Quantitative Methods

1.A: Sampling and Estimation

a: Define simple random sampling.

Simple random samplingis a method of selecting a sample in such a way that each item or person in the population begin studied has the same (non-zero) likelihood of being included in the sample. This is the standard sampling design.

b: Define and interpret sampling error.

Sampling erroris the difference between a sample statistic (the mean, variance, or standard deviation of the sample) and its corresponding population parameter (the mean, variance or standard deviation of the population).

The sampling error of the mean = sample mean - population mean = X bar -μ.

c: Define a sampling distribution

The sample statistic itself is a random variable, so it also has a probability distribution. The sampling distribution of the sample statistic is a probability distribution made up of all possible sample statistics computed from samples of the same size randomly drawn from the same population, along with their associated probabilities.

d: Distinguish between simple random and stratified random sampling.

Simple random samplingis where the observations are drawn randomly from the population. In a random sample each observation must have the same chance of being drawn from the population. This is the standard sampling design.

?

Stratified random samplingfirst divides the population into subgroups, called strata, and then a sample is randomly selected from each stratum. The sample drawn can be either a proportional or a non-proportional sample. A proportional sample requires that the number of items drawn from each stratum be in the same proportion as that found in the population.

e: Distinguish between time-series and cross-sectional data.

A time-seriesis a sample of observations taken at a specific and equally spaced points in time. The monthly returns on Microsoft stock from January 1990 to January 2000 are an example of time-series data.

Cross-sectionaldatais a sample of observations taken at a single point in time. The sample of reported earnings per share of all Nasdaq companies as of December 31, 2000 is an example of cross-sectional data.

f: State the central limit theorem and describe its importance.

The central limit theorem tells us that for a population with a mean μ and a finite variance σ2, the sampling distribution of the sample means of all possible samples of size n will be approximately normally distributed with a mean equal to μ and a variance equal to σ2/n.

The central limit theorem is extremely useful because the normal distribution is relatively easy to work with when doing hypothesis testing and forming confidence intervals. We can make very specific inferences about the population mean, using the sample mean, no matter the distribution of the population, as long as the sample size is "large."

What you need to know for the exam:

1. If the sample size n is sufficiently large (greater than 30), the sampling distribution of the sample means will be approximately normal.

2. The mean of the population, μ, and the mean of all possible sample means, μx, are equal.

3. The variance of the distribution of sample means is σ2/n.

g: Calculate and interpret the standard error of the sample mean.

Standard error of the sample meansis the standard deviation of the sampling distribution of the sample means. The standard error of the sample means when the standard deviation of the population is known is calculated by: σx = σ/√ n, where: σx = the standard error of the sample means, σ = the standard deviation of the population, and n = the size of the sample.

Example:The mean hourly wage for Iowa farm workers is $13.50 with a standard deviation of $2.90. Let x be the mean wage per hour for a random sample of Iowa farm workers. Find the mean and standard error of the sample means, x, for a sample size of 30.

The mean μx of the sampling distribution of x is μx = μ = $13.50. Since σ is known, the standard error of the sample means is: σx = σ/ √n = 2.90 / √30 = $.53. In conclusion, if you were to take all possible samples of size 30 from the Iowa farm worker population and prepare a sampling distribution of the sample means you will get a mean of $13.50 and standard error of $.53.

h: Distinguish between a point estimate and a confidence interval estimate of a population parameter.

Point estimatesare single (sample) values used to estimate population parameters. The formula we use to compute the point estimate is called the estimator. For example, the sample mean X bar is an estimator of the population mean μ, and is computed using the following formula:

X bar = (Σ x / n)

The value we obtain from this calculation for a specific sample is called the point estimate of the mean.

A confidence interval is a range of estimated values within which the actual value of the parameter will lie with a given probability of 1 - α. The term α is also called the significance level of the test. It is also known as the confidence level.

i: Identify and describe the desirable properties of an estimate.

When we have a choice among several estimators, we want to select the one with the most desirable statistical properties: unbiasedness, efficiency, and consistency.

An unbiased estimator is one for which the expected value of the estimator is equal to the parameter you are trying to estimate.

An unbiased estimator is also efficient if the variance of its sampling distribution is smaller than all the other unbiased estimators of the parameter you are trying to estimate. The sample mean, for example, is an efficient estimator of the population mean.

A consistent estimator provided a more accurate estimate of the parameter as the sample size increases. As the sample size increases, the standard deviation (standard error) of the sample mean falls and the sampling distribution bunches more closely around the population mean.

j: Calculate and interpret a confidence interval for a population mean, given a normal distribution with a known population variance.

If the distribution of the population is normal and we know the population variance, we can construct the confidence interval for the population mean as follows:

?

X bar Zα/2 (σ /?√n)

?

Example:Suppose we administer a practice exam to 100 CFA Level I candidates, and we discover the mean score on this practice exam for all 36 of the candidates in the sample who studied at least 10 hours a week in preparation for the exam is 80. Assume the population standard deviation is 15. Construct a 99% confidence interval for the mean score on the practice exam of candidates who study at least 10 hours a week.

?

80 2.575 (15 / √36) = 80 6.4

?

The 99% confidence has a lower limit of 73.6 and an upper limit of 86.4.

k: Describe the properties of Student's t-distribution.

The student's t-distribution is similar, but not identical to the normal distribution in shape. It is defined by a single parameter (the degrees of freedom), whereas the normal distribution is defined by two parameters (the mean and variance).

The student's t-distribution has the following properties:

· It is symmetrical.

· It is defined by a single parameter, the degrees of freedom (df), where the degrees of freedom are equal to the number of sample observations minus one. (n - 1).

· It is less peaked than a normal distribution, with more probability in the tails.

· As the degrees of freedom (the sample size) gets larger, the shape of the t-distribution approaches a standard normal distribution.

l: Calculate and interpret a confidence interval for a population mean, given a normal distribution with an unknown population variance.

Example:Suppose you take a sample of the past 30 monthly returns for McCreary Inc. The mean return is 2%, and the sample standard deviation is 20%. The standard error of the sample was found to be 3.6%. Construct a 95% confidence interval for the mean monthly return.

Because there are 30 observations, the degrees of freedom are 30 - 1 = 29. Remember, because this is a two-tailed test, we want the total probability in the tails to be α = 5%; because the t-distribution is symmetrical, the probability in each tail will be 2.5% when df = 29. From the t-table, we can determine that the reliability factor for tα/2, or t2 5, is 2.045. Then the confidence interval is:

2 2.045 (20 / √30) = 2% 7.4%

The 95% confidence interval has a lower limit of -5.4% and an upper limit of 9.4%.

m: Discuss the issues surrounding selection of the appropriate sample size.

When the distribution is non-normal, the size of the sample influences whether or not we can construct the appropriate confidence interval for the sample mean. If the distribution is non-normal, but the variance is known, we can still use the Z-statistic as long as the sample size is large (n > 30). We can do this because the central limit theorem assures us that the distribution of the sample mean is approximately normal when the sample is large.

If the distribution is non-normal and the variance is unknown, we can use the t-statistic as long as the sample size is large (n > 30).

This means that if we are sampling from a non-normal distribution (which is sometimes the case in finance), we cannot create a confidence interval if the sample size is less than 30. So, all else equal, make sure you have a sample larger than 30, and the larger, the better.

n: Define and discuss data-snooping/data-mining bias.

Data-snooping biasoccurs when the researcher bases his research on the previously reported empirical evidence of others, rather than on the testable predictions of well-developed economic theories.

Data snooping often leads to data mining, when the analyst continually uses the same database to search for patterns or trading rules until he finds one that "works." For example, some researchers argue that the value anomaly (in which value stocks appear to outperform growth stocks) is actually the product of data mining. Because our data set of historical stock returns is quite limited, we don't know for sure whether the differences between value and growth stock returns is a true economic phenomenon, or simply a chance pattern that we stumbled on after repeatedly looking for any pattern in the data.

o: Define and discuss sample selection bias, survivorship bias, look-ahead bias, and time-period bias.

Sample selection biasoccurs when some data is systematically excluded from the analysis, usually because of the lack of variability. Then the sample isn't random, and any conclusions we draw from the observed sample can't be applied to the population because the observed sample and the rest of the population we didn't observe are different.

The most common form of sample selection bias is survivorship bias. A good example of the problems associated with survivorship bias in investments is the study of mutual fund performance. Most mutual fund databases (like Morningstar's) only include funds currently in existence (the "survivors"), and do not include funds that have been closed down or merged.

Look-ahead biasoccurs when the analyst uses historical data that was not publicly available at the time being studied. For example, financial information is not usually available until 30 to 60 days after the end of the fiscal year. A study that uses market-to-book value ratios to test trading strategies might estimate the book value as reported at fiscal year end and the market value two months later in order to account for this bias.

Time-period biascan result if the time period over which the data is gathered is either too short or too long. If the time period is too short, the results may reflect phenomenon specific to that time period, or perhaps data mining. If the period is too long, the fundamental economic relationships that underlie the results may have changed.

1.B: Hypothesis Testing

a: Define a hypothesis and describe the steps of hypothesis testing.

A hypothesis is a statement about the value a of population parameter developed for the purpose of testing a theory or belief. Hypothesis testing is a procedure based on evidence from samples and probability theory used to determine whether a hypothesis is a reasonable statement and should not be rejected, or is an unreasonable statement and should be rejected. The process of hypothesis testing consists of:

Stating the hypothesis.

Selecting the appropriate test statistic.

Specifying the level of significance.

Stating the decision rule regarding the hypothesis.

Collecting the sample and calculating the sample statistics.

Making a decision regarding the hypothesis.

Making a decision based on the results of the test.

b: Define and interpret the null hypothesis and alternative hypothesis.

The null hypothesis is the hypothesis that the researcher wants to reject. This is the hypothesis that is actually tested and is the basis for the selection of the test statistics. The null is generally stated as a simple hypothesis, such as:

H0:?μ = μ0

Where μ is the population mean and μ0 is the hypothesized value of the population parameter. In our option return example, μ is the return on options (the true return for the population) and μ0 is zero.

c: Discuss the choice of the null and alternative hypotheses.

The null hypothesis is what the researcher wants to disprove. The alternative hypothesisis what is concluded if the null hypothesis is rejected by the test. It is usually the alternative hypothesis that you are really trying to assess. Why? Since you can never really prove anything with statistics, when you discredit the null hypothesis you are implying that the alternative is valid.

d: Distinguish between one-tailed and a two-tailed hypothesis tests.

The alternative hypothesiscan be one-sided or two-sided (a one-sided test is referred to as a "one-tailed" test and a two-sided test is referred to as a "two-tailed" test). Whether the test is one or two-sided depends on the theory. If a researcher wants to test whether the return on options is greater than zero, she may use a one-tailed test. However, if the research question is whether the return on options is different from zero, the researcher may use a two-tailed test (which allows for deviation on both sides of the null value). In practice, most tests are constructed as two-tailed tests.

e: Define and interpret a test statistic.

Hypothesis testing involves two statistics: the test statisticcalculated from the sample data, and critical valueof the test statistic (i.e., the critical value). The comparison of the calculated test statistic to the critical value is a key step in assessing the validity of the hypothesis.

We calculate a test statistic by comparing the estimated parameter (that is, a parameter such as mean return that is calculated from the sample) with the hypothesized value of the parameter (that is, what is specified in the null hypothesis).

Test statistic = (sample statistic - hypothesized value) / (standard error of the sample statistic).

f: Define and interpret a significance level and explain how significance levels are used in hypothesis testing.

Recall the assertion that you can't prove anything with statistics (you can only support or reject an idea or hypothesis). This is because we are using a sample to make inferences about a population. Hopefully, a well-selected sample will yield insight into the true parameters of the population, but we must be careful to note that there is a chance that our sample is somehow unrepresentative of the population.

g: Define and interpret a Type I and a Type II error.

With hypothesis testing, there are basically two possible errors: 1) rejection of a hypothesis when it is actually true (a type I error) and 2) acceptance of a hypothesis with it is actually false (a type II error). The significance levelis the risk of making a type I error (rejecting the null when it is true). For instance, a significance level of 5% means that, under the assumed distribution, there is a 5% chance of rejecting a true null hypothesis (this is also called an alpha of .05).

Null Hypothesis is TrueNull Hypothesis is False

Fail to Reject Null HypothesisCorrect decisionIncorrect decision (Type II error)

Reject Null HypothesisIncorrect decision (Type I error)Correct decision

h: Define the power of a test.

The power of a test is the probability of correctly rejecting the null hypothesis when the null hypothesis is indeed false. The power of a test statistic is important, because we wish to use the test statistic that provides the most powerful test among all the possible tests.

i: Define and interpret a decision rule.

The decision rule for rejecting or failing to reject the null hypothesis is based on the test statistic's distribution. For example, if a z-statistic is used, the decision rule is based on critical values determined from the normal distribution.

The critical values are determined based on the distribution and the significance level chosen by the researcher. The significance level may be 1%, 5%, 10%, or any other level. The most common is 5%.

If we reject the null hypothesis, the result is statistically significant; if we fail to reject the null hypothesis, the result is not statistically significant.

j: Explain the relationship between confidence intervals and tests of significance.

A confidence interval is a range of values within which the researcher believes the true population parameter may lie. The range of values is determined as:

{sample statistic - (critical value)(standard error) < population parameter < sample statistic + (critical value)(standard error)}

The strict interpretation of a confidence interval is that for a level of confidence of, say 95%, we expect 95% of the intervals formed in this manner to have captured the true population parameter.

k: Distinguish between a statistical decision and an economic decision

A statistical decision is based solely on the sample information, the test statistic, and the hypotheses. That is, the decision to reject the null hypothesis is a statistical decision. An economic decision considers the relevance of that decision after transaction costs, taxes, risk, and other factors - things that don't enter into the statistical decision. For example, it is possible for an investment strategy to produce returns that are statistically significant, but which are not economically significant once one considers transactions costs.

l: Discuss the p-value approach to hypothesis testing.

The p-valueis the probability that lies outside the calculated test statistic and (in the case of a two-tailed test) its negative or positive value. In other words, the p-value is the probability, assuming the null hypothesis is true, of getting a test statistic value at least as extreme as the one just calculated. You can make decisions using the p-value instead of the critical value of the test-statistic.

If the p-value is less than the significance level of the hypothesis test, H0 is rejected.

If the p-value is greater than the significance level, then H0 is not rejected.

m: Identify the test statistic for a hypothesis test about the population mean of a normal distribution with (1) known or (2) unknown variance.

The choice between the t-distribution and the normal (or z) distribution is dependent on sample size and whether the variance of the underlying population is known (in the real world, the underlying variance of the population is rarely known). Tests of a hypothesis with a known population variance can use the z-statistic. In contrast, tests of hypothesis regarding the mean when the population variance is unknown require the use of the t-distributed test statistic.

The test statistic for testing a population mean is the following t-distributed test statistic (often referred to as the t-test):

tn– 1 = (sample mean – μ0)/(s/√n)

Where:

μ0is thehypothesized sample mean (i.e., the null)

s is the standard deviation of the sample

n is the sample size

This t-distributed test statistic has n-1 degrees of freedom.

Test of hypothesis regarding the mean when the population variance is known require the use of a z-distributed test statistic. The test statistic for testing a population mean is the following z-distributed test statistic (often referred to as the z-test):

z = (sample mean - μ) / (σ / √n)

The test of a hypothesis regarding the mean when the population variance is unknown and the sample size is large allows the use of the z-distributed test statistic. The test statistic for testing a population mean is the following test statistic (often referred to as the t-test):

z = (sample mean - μ0) / (s / √n)

Example:An investor believes that options should have an average daily return greater than zero. To empirically access his belief, he has gathered data on the daily return of a large portfolio of options. The average daily return in his sample is .001, and the sample standard deviation of returns is .0025. The sample size is 250. Explain which type of test statistic should be used and the difference in the likelihood of rejecting the null with each distribution.

The population variance for our sample of returns is unknown (we had to estimate it from the sample). Hence, the t-distribution is appropriate. However, the sample is also large (250), so the z-distribution would also work (it is a trick question - either method is appropriate). What about the difference in likelihood of rejecting the null? Since our sample is so large, the critical values for the t and z are almost identical. Hence, there is almost no difference in the likelihood of rejecting the null.

n: Explain the use of the z-test in relation to the central limit theorem.

The standard deviation of the standard normal distribution is 1. The standard deviation of the t-distribution is df / (df - 2), where df is the degrees of freedom.

If the degrees of freedom are 10, the standard deviation is 1.25. If the degrees of freedom are 30, the standard deviation of the t-distribution is 1.07. For large df, the t and z distributions are similar.

The central limit theorem (CLT) is a theorem that states that for any given distribution with a mean of μ and a variance of σ2 / N, the sampling distribution of the mean approaches a normal distribution with a mean μ and a variance of σ2 / N, the sample size increases.

o: Formulate a null and an alternative hypothesis about a population mean and determine whether the null hypothesis is rejected at a given level of significance.

Example:When your company's gizmo machine is working properly, the mean length of gizmos is 2.5 inches. However, from time to time the machine gets out of alignment and produces gizmos that are either too long or too short. When this happens, production is stopped and the machine is adjusted. To check the machine, the quality control department takes a gizmo sample each day. Today a random sample of 49 gizmos showed a mean length of 2.49 inches. The population standard deviation is 0.021 inches. Using a 5% significance level, should the machine be shut down and adjusted?

Step 1:Let μ be the mean length of all gizmos made by this machine and X bar the corresponding mean for the sample.

Step 2:State the hypothesis.

H0: μ = 2.5 (machine does not need an adjustment)

HA: μ?≠ 2.5 (machine needs an adjustment)

This is a two-tailed test.

Step 3:Select the appropriate test statistic.

z = (X bar - μ0) / (σ /?√n)

Step 4:Specify the level of significance. You are willing to make a Type I error 5% of the time so the level of significance is 0.05.

Step 5:State the decision rule regarding the hypothesis. The?≠ sign in the alternative hypothesis indicates that the test is two-tailed with two rejection regions, one in each tail of the normal distribution curve. Because the total area of both rejection regions combined is 0.05 (the significance level), the area of the rejection region in each tail is 0.025. The table value is +/- 1.96. This means that the null hypothesis should not be rejected if the computed z value lies between -1.96 and +1.96, and rejected if it lies outside of these critical values.

Step 6:Collect the sample and calculate the test statistic. The value of X bar from the sample is 2.49. Since population standard deviation is given as 0.021, we calculate the z test statistic using σ as follows:

z = (2.49 - 2.5) / (0.021 /?√49) = (-.01 / 0.003) = -3.33

Step 7:Make a decision regarding the hypotheses. The calculated value of z = -3.33 is called the calculated, computed, or observed test statistic. This z-value indicates the location of the observed sample mean relative to the population mean (3.33 size-adjusted standard deviations to the left of the mean). Now, compare the calculated z to the critical z value. The calculated z value of -3.33 is less than the critical value of -1.96, and it falls in the rejection region in the left tail. Reject H0.

p: Identify the test statistic for a hypothesis test about the equality of two population means of two normally distributed populations based on independent samples.

A test of differences in means requires using a test statistic chosen depending on two factors: 1) whether the population variances are equal, and 2) whether the population variances are known. This test can be used to test the null hypothesis:

H0: μ1 - μ2 = 0 vs. the alternative of HA: μ1 - μ2?≠ 0

The test when population means are normally distributed and the population variances are unknown but assumed equal uses a pooled variance. Use the t-distributed test statistic:

t = {(X bar1 - X bar2) - (μ1 - μ2)} / {(sp2 / n1) + (sp2 / n2)}1/2

The test when the population means are normally distributed and population variances unknown and cannot be assumed to be equal uses the t-distributed test statistic that uses both samples' variances:

t = {(X bar1 - X bar2) - (μ1 - μ2)} / {(s12 / n1) + (s22 / n2)}1/2

q: Formulate a null and an alternative hypothesis about the equality of two population means (normally distributed populations, independent samples) and determine whether the null hypothesis is rejected at a given level of significance.

Sue Smith is investigating whether the announcement period abnormal returns to acquiring firms differ for horizontal and vertical mergers. She estimated the abnormal announcement period returns for a sample of acquiring firms associated with horizontal mergers and a sample of acquiring firms involved in vertical mergers. She estimated the abnormal announcement period returns for a sample of acquiring firms associated with horizontal mergers and a sample of acquiring firms involved in vertical mergers.

Abnormal returns for horizontal mergers: mean = 1%, standard deviation = 1%, sample size = 64.

Abnormal returns for vertical mergers: mean = 2.5%, standard deviation 2.0%, sample size = 81.

Assume that the population means are normally distributed, and that the population variances are equal. Is there a statistically significant difference in the announcement period abnormal returns for these two types of mergers?

Step 1:State the hypothesis.

H0: μ1 - μ2 = 0

HA: μ1 - μ2 ≠ 0

Where μ1 is the mean of the abnormal returns for the horizontal mergers and μ2 is the mean of the abnormal returns for the vertical mergers.

Step 2:Select the appropriate test statistic. Use the test statistic formula that assumes equal variances (from LOS 1.B.p).

Step 3:Specify the level of significance. There are n1 + n2 - 2 degrees of freedom. Therefore, there are 64 + 81 - 2 = 143 degrees of freedom. We will use the common significance level of 5%.

Step 4:State the decision rule regarding the hypothesis. Using the t-distribution table and the closest degrees of freedom, the critical value for a 5% level of significance is 1.980.

Step 5:Collect the sample and calculate the sample statistics. Inputting our data into the formula:

sp2 = (63)(0.0001) + (80)(0.0004) / 143 = 0.000268

t = -.015 / 0.00274 = -5.474

Step 6:Make a decision regarding the hypothesis. Because the calculated test statistic falls to the left of the lowest critical value, we reject the null hypothesis. We conclude that the announcement period abnormal returns are different for horizontal and vertical managers.

r: Identify the test statistic for a hypothesis test about the mean difference for two normal distributions (paired comparisons test).

Frequently, we are interested in the difference of paired observations. If questions about the differences can be expressed as a hypothesis, we can test the validity of the hypothesis. When paired observations are compared, the test becomes a test of paired differences. The hypothesis becomes:

H0: μd = μd0

HA: μd ≠ μd0

Where μd is the population mean of differences and μd0 is the hypothesized difference in means (often zero). the alternative may be one-sided:

HA: μd > μd0

HA: μd < μd0

For paired differences, the test statistic is: t = {(d bar - μd0) / sd bar}.

s: Formulate a null and an alternative hypothesis about the mean difference of two normal populations (paired comparisons test) and determine whether the null hypothesis is rejected at a given level of significance.

Joe Andrews is examining changes in estimated betas for the common stock of companies in the telecommunications industry before and after deregulation. Joe believes that the betas may decline because of deregulation, because companies are no longer subject to the uncertainties of rate regulation, or may increase because there is more certainty regarding competition in the industry.?

Mean of differences in betas = 0.23

Sample standard deviation of differences = 0.14

Sample size = 39

Step 1:State the hypothesis. There is reason to believe that the mean differences may be positive or negative, so a two-sided alternative hypothesis is in order here.

H0: μd? = 0

HA: μd ≠ 0

Step 2:Select the appropriate test statistic. Use the test statistic formula for paired differences (from LOS 1.B.r).

Step 3:Specify the level of significance. We will use the common significance level of 5%.

Step 4:State the decision rule regarding the hypothesis. There are 39 - 1 = 38 degrees of freedom. Using the t-distribution table, the critical value for a 5% level of significance is 2.024.

Step 5:Collect the sample and calculate the sample statistics. Inputting our data into the formula:

t = (0.23 - 0) / (0.14 /?√39) = 0.23 / 0.022418 = 10.2596

Step 6:Make a decision regarding the hypothesis. We reject the null hypothesis of no difference, concluding that there is a difference in betas from before to after deregulation.

t: Discuss the choice between tests of differences between means and tests of mean differences in relation to the independence of samples.

The test of the differences in means is used when there are two independent samples.

The test of the mean of the difference is used when the samples are not independent, but in fact allow paired comparisons.

u: Identify the test statistic for a hypothesis test about the variance of a normally distributed population.

Given that many financial market observers measure risk as variance, the difference in variance is a common focus of statistical analysis.

A test of the population variance requires the use of a Chi-squared distributed test statistic.

The hypotheses tested are:

H0: σ12 = σ02

HA: σ12≠ σ02

The alternative hypothesis may also be one-sided.

The Chi-squared distribution is asymmetrical and approaches the normal distribution in shape as the degrees of freedom increase.

The Chi-squared test statistic is:

χ2 = (n - 1)s2 / σ02

v: Formulate a null and an alternative hypothesis about the variance of a normally distributed population and determine whether the null hypothesis is rejected at a given level of significance. In the past, High-Return Equity Fund has advertised that the standard deviation of monthly returns on the fund have been 4%. This was based on estimates from the 1990-1998 period. High-Return wants to verify whether that still adequately describes the standard deviation of the fund's returns. It has collected monthly returns for the period 1998-2000 and determined that the standard deviation of monthly returns is 3.8 % over these 24 months. Is the more recent standard deviation different from the advertised standard deviation??

Step 1:State the hypothesis. The null hypothesis is that the variance of monthly returns is 4%2, or 0.0016. Therefore, this is a two-sided test.

H0: σ12 = 0.0016

HA: σ12≠ 0.0016

Step 2:Select the appropriate test statistic. Use the test statistic formula for Chi squared (from LOS 1.B.u).

Step 3:Specify the level of significance. Choosing a 5% level of significance means that there will be a 2.5% probability in each tail of the Chi-square distribution.

Step 4:State the decision rule regarding the hypothesis. There are 23 degrees of freedom. Using the Chi-square values for 23 degrees of freedom and probabilities of 0.975 and 0.025, the critical values are 11.689 and 38.076.

Step 5:Collect the sample and calculate the sample statistics. Inputting our data into the formula:

χ2?= (23)(0.001444) / 0.0016 = 0.033212 / 0.0016 = 20.7575

Step 6:Make a decision regarding the hypothesis. We fail to reject the null hypothesis that the variance is 4% because the computed statistic lies between our two critical values.

w: Identify the test statistic for a hypothesis test about the equality of the variances of two normally distributed populations, based on two independent random samples.

The equality of variances of two populations can be tested with an F-distributed test statistic. The hypotheses tested are:

H0: σ12 = σ22

HA: σ12≠ σ22

One-sided alternative tests are also permissible. It is assumed that the populations from which the samples are drawn are normally distributed. The test statistic is F-distributed:

F = s12 / s22

With n1 - 1 and n2 - 1 degrees of freedom.

The F-distribution is right-skewed and is truncated at zero on the left-hand side. The shape of the F-distribution is determined by two degrees of freedom (one pertaining to the numerator, one pertaining to the denominator). The rejection region is always in the right side tail of the distribution. Therefore, when constructing this test statistic, always put the larger variance in the numerator.

x: Formulate a null and an alternative hypothesis about the equality of the variances of two normal populations and, given the test statistic, determine whether the null hypothesis is rejected at a given level of significance.

Annie Cower is examining the earnings for two different industries. Cower has noticed that the earnings of the textile industry seem to be more divergent than those of the paper industry. To confirm this, Cower looked at a sample of 31 textile manufacturers and a sample of 41 paper companies. She calculated the standard deviation of earnings across the textile industry as $4.30, and that of the paper industry companies is $3.80. Are the earnings of textile manufacturers more divergent than those of the paper companies?

Step 1:State the hypothesis. The null hypothesis is that the variance of monthly returns is 4%2, or 0.0016. Therefore, this is a two-sided test.

H0: σ12 = σ22

HA: σ12>σ22

Where σ12 is the variance of earnings of the textile manufacturers and σ22 is the variance of earnings of the paper companies.

Step 2:Select the appropriate test statistic. Use the test statistic formula for F-distributed (from LOS 1.B.w).

Step 3:Specify the level of significance. This means that the calculated F-value's p-value must be less than 5% in order to reject the null hypothesis.

Step 4:State the decision rule regarding the hypothesis. The appropriate critical F-value is taken from the F-distribution table for a 5% level of significance for 30 and 40 degrees of freedom. If the calculated statistic is greater than the critical value of 1.74, we reject the null hypothesis of equal variances.

Step 5:Collect the sample and calculate the sample statistics. Inputting our data into the formula:

F = 4.302 / 3.802 = 18.49 / 14.44 = 1.2805

Step 6:Make a decision regarding the hypothesis. Because the calculated F-statistic of 1.2805 is less than the critical F-statistic of 1.74, we fail to reject the null hypothesis. The variances are not different from one another.

y: Distinguish between parametric and nonparametric tests.

Parametric tests rely on assumptions regarding the distribution of the population and are specific to parameters.

Nonparametric tests either do not consider a parameter or have few assumptions about the population that is sampled.

Often nonparametric tests are used along with parametric tests. In this way, the nonparametric test is a backup in case the assumptions underlying the parametric test do not hold.

1.C: Correlation and Regression

a: Define and interpret a scatter plot.

A scatter plot is an illustration of the relationship between two variables. In the scatter plot of two variables, X and Y, each point on the plot is an X - Y pair. A scatter plot allows for visual inspection of the data.

b: Define and calculate the covariance between two random variables.

The covariance between two random variables is a statistical measure of the degree to which the two variables move together. The covariance captures how one variable changes when the other variable changes. A positive covariance indicates that the variables tend to move together; a negative covariance indicates that the variables tend to move in opposite directions. The covariance is calculated as:

Covariance = the sum of i = 1 to n of (Xi - X bar)(Yi - Y bar) / n - 1

Where n is the sample size, Xi is the ith observation on variable X, X bar is the mean of the variable X observations, Yi is the ith observation on variable Y, and Y bar is the mean of the variable Y observations.

The actual value of the covariance is not meaningful, because it is affected by the scale of the two variables. That is why we calculate the correlation coefficient - to make something interpretable from the covariance information.

c: Define, calculate, and interpret a correlation coefficient.

The correlation coefficient, r (or also denoted as ρ), is a measure of the strength of the relationship between or among variables. Correlation is a unitless measure of the tendency of two variables to move together. The correlation coefficient is bounded by +1 (the variables move together perfectly) and -1 (the variables move exactly opposite of each other).

r = covariance of X and Y / (standard deviation of X)(standard deviation of Y)

Assume there are 10 observations and you are given the data below:

XYX - X bar(X - X bar)2Y - Y bar(Y - Y bar)2(X - X bar)(Y - Y bar)

Sum1354160.00374.500.002,342.40445.00

X bar = 135 / 10 = 13.5

Y bar = 416 / 10 = 41.6

s2X = 374.5 / 9 = 41.611

s2Y = 2,342.4 / 9 = 260.267

r = (445 / 9) / (√41.611√260.267) = 49.444 / (6.451)(16.133) = 0.475

r = +1: perfect positive correlation

+1 > r > 0: positive relationship

r = 0: no relationship

0 > r > -1: negative relationship

r = -1: perfect positive correlation

d: Describe how correlation analysis is used to measure the strength of a relationship between variables.

The correlation coefficient is bounded by -1 and +1. The closer the coefficient is to these, the stronger the correlation. However, with the exception of the extremes (that is, r = 1.0, or r = -1), we cannot really talk about the strength of a relationship indicated by the correlation coefficient without a statistical test of significance.

Using our previous example in LOS 1.C.c of r = 0.475 and n = 10, the test statistic is:

t =( 0.475√8) / √(1 - 0.475)2= 1.3435 / 0.88 = 1.5267.

To make a decision, compare the calculated t-statistic with the critical t-statistic for the appropriate degrees of freedom and level of significance. Hence, at a 5% level of significance, the correlation is not significantly different from zero (critical t = 2.3060, two-tailed test - look in the 8df row and match that with the .05, two-tailed column or .025 one-tailed column).

e: Formulate a test of the hypothesis that the population correlation coefficient equals zero and determine whether the hypothesis is rejected at a given level of significance.

Example:Suppose the correlation coefficient is 0.2, and the number of observations is 32. What is the calculated test statistic? Is this correlation significantly different from zero using a 5% level of significance?

The hypotheses are:

H0: ρ = 0

HA: ρ ≠ 0

The calculated t-statistic is:

t = .2(√32 - 2) / (√1 - 0.04) = 0.2 √30 / √0.96 = 1.11803

Degrees of freedom = 32 - 2 = 30. Hence, the critical t-value for a 5% level of significance and 30 df is 2.042. Therefore, there is no significant correlation (1.11803 falls between the two critical values of -2.042 and +2.042).

f: Define an outlier and explain how outliers can affect correlations.

An outlier is an extreme value of a variable. The outlier may be quite large or small. An outlier may affect the sample statistics, such as a correlation coefficient. Two things to note about outliers:

1. Outliers can cause us to conclude that there is a significant relationship, when, in fact, there is none or to conclude that there is no relationship when in fact there is a relationship.

2. The researcher must exercise judgment (and caution) when deciding whether to include or exclude an observation.

g: Explain the nature of a spurious correlation

Spurious correlationis the appearance of a relationship when, in fact, there is no relation. Certain data items may be highly correlated but not necessarily a result of a causal relationship. A good example of a spurious correlation is snowfall and stock prices in January. If you regress historical stock prices on snowfall totals in Minnesota, you would get a statistically significant relationship - especially for the month of January. Since there is not an economic reason for this relationship, this would be an example of spurious correlation.

h: Explain the difference between dependent and independent variables in a linear regression.

The variables in a regression consist of dependent and independent variables. The dependent variable is the variable whose variation is being explained by the other variables. Also referred to as the explained variable, the endogenous variable, or the predicted variable.

The independent variable is the variable whose variation is used to explain that of the dependent variable. Also referred to as the explanatory variable, the exogenous variable, or the predicting variable.

i: Distinguish between the slope and the intercept terms in a regression equation.

The parameters in a simple regression equation are the slope (b1) and the intercept(b0):

Yi = b0 + b1 Xi + εi

Where:

Yi = the ith observation on the dependent variable

Xi = the ith observation on the independent variable

b0 = the intercept

b1 = the slope coefficient

εi = the residual for the ith observation

The slope, b1, is the change in Y for a given one-unit change in X. The slope can be positive, negative, or zero.

The intercept,b0, is the line's intersection with the Y-axis at X = 0. The intercept can be positive, negative, or zero.

j: List the assumptions underlying linear regression.

Major assumptions:

1. A linear relationship exists between the dependent and independent variable.

2. The independent variable is uncorrelated with the residuals.

3. The expected value of the disturbance term is zero, that is the mean of the residuals is zero.

4. There is a constant variance of the distribution term. In other words, the disturbance terms are homoskedastistic.

5. The residuals are independently distributed; that is, the residual or disturbance for one observation is not correlated with that of another observation.

6. The disturbance term (residual, or error term) is normally distributed.

k: Define and calculate the standard error of the estimate.

The standard error of the estimate (SEE - also referred to as the standard error of the residual or standard error of the regression, and often indicated as se) is the standard deviation of the predicted dependent variable values about the estimated regression line.

The SEE is easy to calculate. Recall that regression minimizes the sum of the squared vertical distances between the predicted value and the actual value for each observation. The sum of the squared prediction errors is called the sum of squared errors (SSE - not to be confused with SEE). If the relationship between the variables in the regression is very strong, then the prediction errors (and the SSE) will be small. Hence, the standard error of the estimate is a function of the SSE.

Standard error of the estimate (SEE) =√se2 = √(SSE / n - 2).

l: Define and calculate the coefficient of determination.

The coefficient of determinationis another way to measure the relationship between the X and Y variables. The coefficient of determination tells you what proportion of the total variation of the dependent variable (Y) is explained or accounted for by the variation in the independent variable (X). The coefficient of determination is called R2 because mathematically it turns out to be just the square of the coefficient of correlation (r). Assuming a correlation coefficient of .86, we discover that the R2 of the index and stock returns is (.86)2 = .74.

A R2 of .74 tells us that 74% of the variation in the stock’s returns (the bell curve shown on the Y-axis) is explained by the variation of the return in the index (the bell curve shown on the X-axis). R2 also describes the systematic relationship between the movement of two variables. In investment finance terms, you would say the movement of the market explains 74% of the stock’s total risk (defined as the stock’s variability). So, 26% of the stock’s volatility is not explained and is unique to the company. Total risk = systematic risk + unsystematic risk.

m: Calculate a confidence interval for a regression coefficient.

A confidence interval is the range of regression coefficient values for a given value estimate of the coefficient and a given level of probability. The confidence interval for a regression coefficient b1 is calculated as:

b1 tcsb1

Where tc is the critical t-value for the selected confidence level. Although this looks slightly different than what we've seen before, it is precisely the same. All confidence intervals take the predicted value then add and subtract the critical test statistic times the variability of the statistic.

The interpretation of the confidence interval is that this is an interval that we believe will include the true parameter with the specified level of confidence. As the standard error of the estimate rises, the confidence interval widens. In other words, the more variable the data, the less confident you will be when you're using the regression model to estimate the coefficient.

n: Identify the test statistic for a hypothesis test about the population value of a regression coefficient.

Hypothesis testing for a regression coefficient involves placing a band around the estimated coefficient. The true parameter is believed to lie somewhere in the band. A high standard error for the coefficient will cause the confidence interval to be wider.

A frequent question is whether the estimated coefficient is statistically different from zero. If the confidence interval does not include zero, then the coefficient is said to be statistically different from zero (with a specified level of confidence).

To test the hypothesis concerning the slope coefficient (e.g., to see whether the estimated slope is equal to a hypothesized value, say b0) we calculate a t-distributed statistic.

If the t-statistic is greater than the critical t-value for the appropriate df, (of less than the critical t-value for a negative slope) we can say that the slope coefficient is different from the hypothesized value, b1.

To test whether an independent variable explains the variation in the dependent variable, the hypothesis that is tested is whether the slope is zero:

H0: b1 = 0 versus the alternative (what you conclude if you reject the null, HA: b1 ≠ 0

o: Formulate a null and an alternative hypothesis about a population value of a regression coefficient and determine whether the null hypothesis is rejected at a given level of significance.

Example:Suppose the estimated slope coefficient is 0.78, the sample size is 26, the standard error of the coefficient is 0.32, and the level of significance is 5%. Is the slope different than zero?

The calculated test statistic is: tb = 0.78 - 0 / 0.32 = 2.4375.

The critical t-values are 2.060 (from the t-table with 24 df). Therefore, we reject the null hypothesis, concluding that the slope is different from zero. Note that if we had formed a confidence interval (i.e., 0.78 .32 * 2.060), zero would not have been included in the interval. The hypothesis test and the confidence interval will always lead to the same conclusion.

p: Interpret a regression coefficient.

Interpretation of coefficients:

The estimated intercept is interpreted as the value of the dependent variable (the Y) if the independent variable (the X) takes on a value of zero.

The estimated slope coefficient is interpreted as the change in the dependent variable for a given one-unit change in the independent variable.

Any conclusions regarding the important of an independent variable in explaining a dependent variable requires determining the statistical significance of the slope coefficient. Simply looking at the magnitude of the slope coefficient does not address the issue of the importance of the variable (i.e., you must perform a hypothesis test or create a confidence interval to assess the importance of the variable).

q: Calculate a predicted value for the dependent varialbe, given an estimated regression model and a value for the independent variable.

Forecastingusing regression involves making predictions about the dependent variable based on average relationships observed in the estimated regression. Predicted values are values of the dependent variable based on the estimated regression coefficients and a prediction about the values of the independent variables. For a simple regression, the value of Y is predicted as:

Y = b0 + biXp

Where Y is the predicted value of the dependent variable and Xp is the predicted value of the independent variable (input).

Example:Suppose you estimate a regression model with the following parameters:

Y = 1.50 + 2.5 X1

In addition, you have forecasted the value of the independent variable to be 20 (i.e., X1 = 20). What is the forecasted value of the Y variable?

Y = 1.50 + 2.50(20) = 1.50 + 50 = 51.5

r: Calculate and interpret a confidence interval for the predicted value of a dependent variable.

Confidence intervals on the predicted value of a dependent variable are calculated in a manner similar to the confidence interval on the coefficient. The hard part about confidence intervals on the dependent variable is calculating the standard error of the forecast (sf). The equation is:

Y +/- tcsf

The standard error of the forecast, sf, is larger than the standard error of the regression, se. It's unlikely that you will have to calculate the standard error of the forecast. However, if you do need to calculate the variance of the forecast, the formula is:

sf2 = se2 [ (1) + (1 / n) + (X - X bar)2 / (n - 1)sx2 ]

Where se2 is the variance of the regression (i.e., SEE2) and sx2 is the variance of the independent variable.

Example:Suppose an analyst generates the following regression results:

Y = 0.01 + 1.2X

SEE = 0.23 (square to get se2), sx = 0.16 (square to get sx2), n = 32, and X bar = 0.06.

Calculate the value of the dependent variable given that the forecast value of X is 0.05. Calculate a confidence interval on the forecasted value. Use a significance level of 5%.

Y = 0.01 + 1.2(0.05) = 0.07

Using a 5% significance level, the critical t-value is 2.042 (t-table, 32-2 = 30 df). The variance of the forecast is:

sf2 = 0.0529 [ (1) + (1 / 32) + (0.05 - 0.06)2 / (32 - 1)(0.0256) ] = 0.05456.

The standard error of the forecast is √0.05456 or 0.23358. Hence, the prediction interval is:{-0.40697 < Y < 0.54697}

s: Describe the use of analysis of variance (ANOVA) in regression analysis. An ANOVA table is a summary of the explanation of the variation in the dependent variable and is included in the regression output of many statistical software packages. You can think of the ANOVA table as the source of the data for the computation of many of the concepts discussed in this summary. For instance, the data to compute the R2 and the standard error of the estimate (SEE) comes from the ANOVA table. There are many ways the data from this table can be used in the statistical inference process (most beyond the scope of the CFA curriculum).

t: Define and interpret an F-statistic.

The F-statistic is used in hypothesis testing. Though it can be used in simple regression, it is more often used to test hypotheses involving more than one independent variable. The most common application is for the test of the significance of the entire set of independent variables:

Ho: b1 = b2 = b3 ... bk = 0

HA: at least one beta different from zero

The F-statistic is used to test whether at least one independent variable in the set of independent variables explains a significant portion of the variation of the dependent variable. This is a goodness of fit test.

The F-statistic is a measure of how well the independent variables, as a group, explain the variation in the dependent variable. It is calculated with the following formula:

F = Mean square regression, MSR / Mean square error, MSE = {(SSR / k) / (SSE / n - k - 1)}

To determine whether an F-statistic is statistically significant, we compare the calculated F-statistic with the critical F-statistic for k(numerator) and n - k - 1 (denominator) degrees of freedom (n = number of observations, k = number of slope coefficients).

In a simple regression, the F-statistic is equal to the squared t-statistic of the slope coefficient. So, for regression with only one independent variable, the F-statistic is redundant.

The analysis of the F-statistic is similar to the t-statistic, except you use the F-table, and you need to worry about the degrees of freedom in both the numerator and denominator of the previous equation. The numerator df is the number of independent variables. The denominator df is [n - (number of independent variable + 1)] or n - k - 1.

u: Discuss the limitations of regression analysis.

Limitations of regression analysis:

Regression relations change over time. This is referred to as a non-stationarity.

If the assumptions of regression analysis are not valid, the interpretation and tests of hypotheses are not valid. For example, if the data is heteroskedastic (non-constant variance of the error terms) or exhibits autocorrelation (error terms are not independent), then it is very difficult to use the regression to forecast the dependent variable given information about the independent variables.