Bringing together the AI community in Singapore – companies, startups, researchers, students, professionals – to collaborate, find research and business opportunities and talent.
- Have an interesting story to share?
- Seeking for AI talent for your organization?
- Seeking research interns for your labs?
- Seeking an industry partner for your AI projects?
- Are you a researcher and seeking an industry partner to do a POC or deployment of your IP/research outcomes?
Do you have any questions relating to pandas Foundations? Leave them here!
From 4. Case study – sunlight in Austin - Daily hours of clear sky, the exercise approach is not correct? Data for this exercise has too many rows on some days and too little rows on others. The question was to do boxplot showing fractions of each day having clear sky. df_clean has 10337 rows (representing 1 year of hourly data as introduced by the tutor, note that this number is higher than 365*24, so first alarm here), and after is_sky_clear.resample('D').count(), many days have more than 24 entries (the sea_level_pressure column of these days have ‘M’ representing missing) and 1 of the days has only 18 entries.
The exercise tried to do is_sky_clear.resample('D').sum()/is_sky_clear.resample('D').count() to get that fraction of each day having clear sky, but seems oblivious to the fact that each hour does not necessarily have only 1 (or even any) row representing it, leaving the fraction of the day being sunny biased by the imbalanced number of datapoints within every hour.
(…continued) How do we answer this question of what fraction of each day has clear sky correctly then? By eliminating days that did not have exactly 24 rows in is_sky_clear.resample('D').count() ? How important is it to design fixed time interval when collecting time series data?