Bringing together the AI community in Singapore – companies, startups, researchers, students, professionals – to collaborate, find research and business opportunities and talent.
- Have an interesting story to share?
- Seeking for AI talent for your organization?
- Seeking research interns for your labs?
- Seeking an industry partner for your AI projects?
- Are you a researcher and seeking an industry partner to do a POC or deployment of your IP/research outcomes?
Statistical Thinking in Python (Part 2)
Do you have any questions relating to Statistical Thinking in Python (Part 2)? Leave them here!
Under Bootstrap confidence intervals - Visualizing bootstrap samples.
We plot ecdf of 50 sets of bootstrapped samples on top of the ecdf of the original data.
Why does the bootstrapped ecdf look thicker in the middle region compared to the ends? (You can exaggerate this effect by doing 300 instead of 50 iterations of data generation and plotting).
Can i conclude that np.random.choice is not uniformly selecting from the original data, and is actually selecting from the 700- 900 mm rainfall section more frequently?
Or this conclusion is false because
1) the thinner gray ecdfs closer to the edges (rainfall < 600mm or rainfall > 1000 mm) could actually contain the same number of gray points, but they are just overlaying on top of each other so i can't see.
2) the vertical position of each gray point represents the order in which it appears in the sorted data, numbers in the middle have "much more room to move around" in their order compared to numbers at the edge of the range.
What can we do to the bootstrapped samples to prove quantitatively that np.random.choice is indeed uniform?
Why do one-tailed rather than two-tailed tests? Is one-tail done because there is actually a minimum or maximum bound (whether natural or artificial/financial/practical) that makes consideration of possible values occurring in the opposite tail (the bounded side) irrelevant? Or could it be that there is no bound but the experimenter simply doesn’t care about one side? If yes why/how so? Any examples of where information in 1-direction is more valuable than information in the opposite direction? (This is what I infer when I see 1 tail rather than 2 tail)
In reality, are there more symmetric distributions or asymmetric distributions? Doesn’t the symmetry depend on the measurement calibration/units? If I decide nothing can be below 0, the distribution will cut at 0 and have only a right tail with no left tail, making symmetry impossible. If I allow negative numbers to be considered, the left tail appears. I feel this is important to understand because existence of tails will affect whether one-tail/two tail tests are used.
1) how do you actually form the null hypothesis and what methods ( mean , variance , difference in mean are selected ? )
2) i know that p-value is low means we reject the null. how do you actually put it in words. we obtain p= 0.0063( means 0.63% chance to the right of the histogram ) that there is a difference btw frog A and B?
I did mess around with R before the p-value to check on the factors that are significant in linear regression. are these p-value used the same in the same way?
1), this article is written in a colloquial language, giving you a thinking framework about what is and why make a null hypothesis. https://towardsdatascience.com/hypothesis-testing-decoded-for-movers-and-shakers-bfc2bc34da41
2) Yes in words that would mean there is a statistically significant difference between the forces produced by Frog A and Frog B. To offer a more meaningful conclusion, you can specify the direction of difference. Because diff was defined there as data1-data2 and the function was called with force_a, force_b, the difference is actually "how much stronger is the force from Frog A than Frog B. So a more meaningful conclusion is "There is statistically significant evidence the force from Frog A is STRONGER than the force from Frog B.
3) p-value 0.05 DOES NOT imply confidence interval of 95%.
p-value comes from NHST, in which there is a certain observed value that you want to test, and you simulate a null hypothesis to find number of “more extreme” simulated test statistics compared to the observed value to get p-value. In finding out the CI, there is no such value you are comparing against.
Because the observed value could be any number, while the simulated distribution of the test-statistic (you define this based on what you want to study) stays the same (assuming you use the same test-statistic and same random seed to do the boostrap/permutation/jacknife resampling to create the simulated distribution), the p-value could be any number.
However, the CI is a property of the resampled/simulated distribution, independent of whatever observed test statistic you are testing. The CI only specifies what range of values of the parameter resampled/simulated from the sample would likely contain the true population parameter. The wider CI you define, the higher chance it will contain the true population parameter. For symmetric distributions like sampling distribution of the mean (which gives a normal distribution), 95% is conveniently provided as mean of sampling distribution +- 1.96* standard error.(because going up and down 1.96 covers 95% of the standard normal distribution).
This relationship is not available for use when the sampling distribution is not symmetrical (as happens when you define uncommon statistics to sample). For eg, sampling distribution of variance (shown in the lesson) is not symmetrical but a right skewed chi-squared distribution. In this case +- does not work, in this case np.percentile is used. Why 2.5 , 97.5 to get a 95% range and not 1.5, 96.5 , or 3.5, 98.5 is something I’m still trying to answer. This is what Justin refers to as “hacker statistics” throughout his lessons, that is to simulate by resampling to estimate parameters.
In regression, the null would be the slopes are 0, the predictor has no impact/effect on the dependent variable, the article from Cassie above explains why the null is usually “no effect”.
To summarize, NHST goes 1 step further beyond CI in the statistical analysis pipeline of EDA à CI for parameter/test-statistic estimation à Hypothesis testing. Both require resampling to generate a simulated distribution. CI only describes it, NHST goes further to test a observed statistic against it. This is introduced in part 3 (you would have done part 1/2) of the statistics series from Justin Bois: https://www.datacamp.com/courses/case-studies-in-statistical-thinking that will really help organize what you learnt in part ½. After part 3, you can go to https://www.datacamp.com/courses/statistical-simulation-in-python to understand all about modelling real life problems into the statistical analysis domain. I wish my answer was clearer but I’m also trying to learn how to explain CI vs NHST.
Thanks for your long explanation.
I think the next question did frame up the null hypothesis nicely.
Question framed the hypothesis as
hypothesis that the true mean of Frog B's impact forces is equal to that of Frog C is true.
Since the p-value is low, hence we reject this hypothesis because there is a very rare/extreme chance ( 0.46% probability ) that it will not be the same?
Also, the reason we use this
bs_replicates <= np.mean(force_b)
rather than >
is it because the lesser the difference btw the values the closer that is conforms to the null hypothesis?
Sorry can't edit the previous post.
If we need to check the difference in the distribution, why do we need to use this instead.
it's more than
Replying Andrew's previous 2 posts:
hence we reject this hypothesis because there is a very rare/extreme chance ( 0.46% probability ) that it will not be the same?
It is actually "a very rare/extreme chance that it WILL be the same." Rare chance of same -> high chance of different -> reject null hypothesis of no difference.
the lesser the difference btw the values the closer that is conforms to the null hypothesis?
is correct but the first half of that statement is not BECAUSE of the second half.
The sign has nothing to do with the 2nd half of the statement and nothing to do with which test statistic you chose. The sign is specified to count the “at least as extreme” test statistic values from the simulated null hypothesis on the correct side of the observed test statistic. A direction needs 2 points to specify. 1st point is the null hypothesis. The 2nd point is the observed test statistic. In ex 11, the null hypothesis (1st point) is force_b mean is 0.55N (you can just ignore the whole story about 0.55N being mean of Frog C), what’s important is to remember the simulated null hypothesis is that force_b has mean 0.55N. The 2nd point in ex 11 is the observed test statistic of np.mean(force_b) = 0.419. You always expand outwards from the test statistic value of the null hypothesis when thinking about “extremity”. Since the observed is < the null. To be even more extreme than observed, we have to use <= the observed to see how many values from the null go below the observed.
In ex 12, the null is 0, the observed is empirical_diff_means = np.mean(force_a)-np.mean(force_b) = 0.288, which is > null unlike ex 11. Going outwards from 0, to be more extreme than 0.288 , that will be the >= sign. Why >= and not >? I have no idea. For ex12 you could have very well done force_b-force_a, in this case, you would be comparing a negative observed value to a 0 null, so < sign is appropriate. The p-value will be exactly the same as what you get from force_a-force_b with > sign.
Thanks for the explanation.
1. We need to know the Null
2. Get the baseline.
3. If baseline is more than Null, then we need to calculate >= and vice-versa.
Also, the reason it's >= rather than > is because of this https://en.wikipedia.org/wiki/P-value#Definition_and_interpretation.
For Step 3, actually there is a third scenario, two tailed as per link above.