Also, can someone explain what's going under the hood in pipeline? The application of pipe methods onto each object inside the pipe seems inconsistent.

From the lesson:

pipeline = make_pipeline(scaler, kmeans) pipeline.fit(samples) pipeline.predict(samples)

I was expecting to see a transform or a fit_transform from the scaler after fit but it's nowhere in the above code?

Similar in this answer https://stackoverflow.com/questions/51459406/apply-standardscaler-in-pipeline-in-scikit-learn-sklearn,

how did the person replying come up with this implied workflow of the pipe? Is it documented anywhere? i am trying to understand if there is a rule of thumb to think about which methods are being applied to which class in the pipe, in what order, and if there are repetitions/restarting from an earlier class in the pipe when moving from training-->testing data

]]>Step 0: The data are split into TRAINING data and TEST data according to the cv parameter that you specified in the GridSearchCV.

Step 1: the scaler is fitted on the TRAINING data

Step 2: the scaler transforms TRAINING data

Step 3: the models are fitted/trained using the transformed TRAINING data

Step 4: the scaler is used to transform the TEST data

Step 5: the trained models predict using the transformed TEST data

In reality, are there more symmetric distributions or asymmetric distributions? Doesn’t the symmetry depend on the measurement calibration/units? If I decide nothing can be below 0, the distribution will cut at 0 and have only a right tail with no left tail, making symmetry impossible. If I allow negative numbers to be considered, the left tail appears. I feel this is important to understand because existence of tails will affect whether one-tail/two tail tests are used.

]]>"To draw samples using np.random.normal, we need to provide parameters, the mean and std, to parameterise the (theoretical) normal distribution we are sampling on, the Mean and Std computed from the data are good estimates"

How did he conclude np.mean(michelson_speed_of_light) and np.std(michelson_speed_of_light) are good estimates as input to np.random.normal?

Just from seeing that the michelson_speed_of_light histogram looks normal compared to the normal pdf in the slide "Comparing data to a Normal PDF"?

Or because there is otherwise no other source of information to calculate mean/std for use in np.random.normal?

So what would be a 'bad estimate' here?

Thanks. I would like to ask what is the next course for me to attend to learn for AI or data analysis / Science .

Best Regards

CK

]]>Do you mean this? https://www.datacamp.com/courses/intro-to-python-for-data-science

This is a FREE 4-hour course by Datacamp - you can start ANYTIME.

Cheers!

]]>

I am interesting the "Intro to Python for Data Science" course

Please advise when is the next intake

Best Regards

Lau

]]>

The exercise tried to do is_sky_clear.resample('D').sum()/is_sky_clear.resample('D').count() to get that fraction of each day having clear sky, but seems oblivious to the fact that each hour does not necessarily have only 1 (or even any) row representing it, leaving the fraction of the day being sunny biased by the imbalanced number of datapoints within every hour.

(…continued) How do we answer this question of what fraction of each day has clear sky correctly then? By eliminating days that did not have exactly 24 rows in is_sky_clear.resample('D').count() ? How important is it to design fixed time interval when collecting time series data?]]>