Bringing together the AI community in Singapore – companies, startups, researchers, students, professionals – to collaborate, find research and business opportunities and talent.
- Have an interesting story to share?
- Seeking for AI talent for your organization?
- Seeking research interns for your labs?
- Seeking an industry partner for your AI projects?
- Are you a researcher and seeking an industry partner to do a POC or deployment of your IP/research outcomes?
Statistical Thinking in Python (Part 1)
Do you have any questions relating to Statistical Thinking in Python (Part 1)? Leave them here!
From "Distribution of no-hitters and cycles" exercise, i realized the peak of random variable generated by t1+t2 (both from np.random.exponential) is where the individual t1 and t2 distributions intersect. Why is this so? tau1 and tau2 were chosen to be different (1000, 500) to demonstrate the effect
Image here: https://img.frl/4hods
Also, if tau1 = tau2, the peak will occur at tau1 (or tau2) total waiting time, why is this so?
This lesson also teaches us that we can simulate any story to get its probability density function.
If i wanted to not just observe natural processes, but design a density function (such as an artificial game world) with the end shape in mind, am i able to compose (combine mathematical expressions of PDFs with known values of their corresponding parameters for each distribution) different basic component distributions to achieve that? The lesson makes it seem like we won't know how the combined shape looks like until we simulate it.
I don't understand the purpose of simulation. What do we do with it after that? If it is to calculate probabilities, don't the math expressions of the PDFs/CDFs provide that directly, and gives an even more precise answer than in the lesson where he extends lines from the CDF and reads rough values from the axes?
Assuming a distribution designer wants to achieve a peak at a certain x-value using the t1+t2 model as in the lesson, does simulation allow the function designer to tweak values of tau1 and tau2 until he can get the peak to move to his particular desired x-value? (so it acts like a trial-and-error tool).
Can i say if this designer knows how to manipulate the math, he has totally no need for simulation in designing his function? Eg. think about the distribution of the sum of values after rolling 2 dice. We don't need to simulate to know that it is symmetrical, its range of values is 2-12 and the triangular shape.
From 4. Thinking probabilistically-- Continuous variables, Introduction to the Normal distribution video, tutor mentioned
"To draw samples using np.random.normal, we need to provide parameters, the mean and std, to parameterise the (theoretical) normal distribution we are sampling on, the Mean and Std computed from the data are good estimates"
How did he conclude np.mean(michelson_speed_of_light) and np.std(michelson_speed_of_light) are good estimates as input to np.random.normal?
Just from seeing that the michelson_speed_of_light histogram looks normal compared to the normal pdf in the slide "Comparing data to a Normal PDF"?
Or because there is otherwise no other source of information to calculate mean/std for use in np.random.normal?
So what would be a 'bad estimate' here?
Is R-squared completely useless?
I always thought in linear regression scale does not matter, then i saw that the scale, which affects the slope, affects the R-squared, making me doubt it's usefulness even more.
Reading from left to right, up to down, the graphs have the same RSS,RMSE (simply sqrt(RSS/# points))), but increasing slope, making the VAR (variance of data) increase and thus cause R-square (1-RSS/VAR) to increase.
Does this mean we can say nothing from looking at R-squared (and possibly adjusted R-squared too? not sure if this thought process applies there) and that RMSE (absolute measure in the units of y, unlike relative measure R-squared ) is always 1st choice of evaluation metric?
Heres my attempt to experiment:
from numpy.random import rand from scipy import stats def rss(y_data,y_model): return np.sum(np.square(y_data-y_model)) np.random.seed(1) x = np.arange(1,11,1) # x = 1 ...10 y = 1.05*x +rand(10)*0.5 # y = slope*x + some rand noise EDIT SLOPE OF X HERE TO EXPERIMENT! fig,ax = plt.subplots() _ = ax.plot(x,y,marker='o',markersize=5) ax.xaxis.set_major_locator(plt.MultipleLocator(1)) slope, intercept, r_value, p_value, std_err = stats.linregress(x, y) y_model = intercept+x*slope _ = ax.plot(x,y_model,color='r') print('correlation: ',np.corrcoef(x,y)[0,1]) print("slope: %f\nintercept: %f" % (slope, intercept)) print("r-squared: %f" % r_value**2) print("rss: ",rss(x,y_model)) print('rmse: ', np.sqrt(rss(x,y_model)/10))
How do i produce the same RSS while changing slope like in the 4 graphs? My RSS is decreasing as i increase slope. Is it something to do with using stats.linregress rather than other OLS libraries? (does this exercise actually require fixed RSS/RMSE to explain?)
Can you pls help to check this question?
What is the probability that the winner of a given Belmont Stakes will run it as fast or faster than Secretariat?
Hence, in this case shouldn't it be > 144 instead?
Hi Andrew, the data represents time (seconds). The question was asking about fast/faster, so that translates to <= when discussing time, because a smaller number of seconds is faster.
If you insist on calculating with >144, that is fine too, just that it answers the complement question of "how many horses are slower (rather than faster or as fast) than Secretariat". After getting that answer, 1- (that answer) will give the same answer to this question, because all the probabilities sum to 1.
If you ask why calculate <= 144 if >144 gives the same answer? That’s because <= is a more direct way of calculating and better maps to the investigation question. In hypothesis testing(introduced in the part 2 course), the null hypothesis is usually “there is no change/there is nothing happening/ the number is not especially big/small/far from the mean/usual conditions. We are usually interested in how extreme (could be on the bigger/smaller side of the mean) an observed value is compared to the value implied by the null hypothesis (usually mean) by counting number of occurrences of values produced by the null hypothesis even more extreme than what’s observed. If they can be more extreme, this observed value is nothing special, if they cannot, this observed value is something to investigate more.
How to know which side does more extreme refer to then? Just compare the observed value (144) against the mean value. (mu = 149.22 calculated from previous screen). You see that 144 is less than 149.22, so more extreme for this study refers to <=. “More extreme” will be >= if observed > mean. The count of these “more extreme” values will then provide a p-value to decide whether to reject the null hypothesis by comparing it to a predetermined significance level (benchmarked within industry) specified before doing the experiment. You can further consider whether the lessons are doing 1-tailed or 2-tailed NHST.
If you ask Secretariat is already the fastest horse at 144 secs, how can anyone be faster? That's because when thinking probabilistically, nothing is impossible, events/numbers may just occur less frequently, and that is represented by a tiny height (possibly invisible) on the data's PDF. This is also why if you plot the pd.Series(data).plot(kind='kde') (kernel density estimation) version of histograms, the lines extend beyond the min and max values, so you may want to chop them off (plt.xlim) when visualizing.
Short answer: The default bins is set to 10, if no number or strategy is chosen.
Long answer: Interesting that you've asked. 🙂 Here's a good explanation: https://stackoverflow.com/questions/33458566/how-to-choose-bins-in-matplotlib-histogram/33459231
Have fun reading!