AMA #8 Questions & Answers.

Your most burning questions from AMA #8.

Pivoting and melting seem to do the same things as stacking/unstacking. Is the only difference due to the fact that one deals with the the index but the other does not?

No. Pivoting, stacking and melting are all tools you can use to manipulate your dataframe. (Ie: You want to move a header to an index, or a header to a value field, or an index back to a header.) With practice and experience, you’ll be able to quickly select the right tool for the job you wish to do.

Is it super important to use both table and its alias in a join, otherwise you can cause a query to error or use a lot of resources? What kind of error and resource usage?

It is a very good practice to get into, particularly to make clear distinctions between tables and columns that have the same name.

Remember that SQL is a declarative language, so, if you do not make clear distinctions, the program can become confused and go off to do something strange. It is the program trying to find its way out of this confusion that causes errors and large resource usage.

In the data modelling context, ‘NORMALISATION’ pertains to the process of tidying up the respective Data Tables to ensure qualities such as no duplication of same info in different Tables, each record being uniquely identifiable by its primary key etc before they are used. I see the term Normalisation also being used for pre-processing during the Data Science sessions.  Is this in the same contextual purposes or just a loose similarity as it seem to refer to ‘taming the data sample values spread’?

Normalization has many different meanings depending on what stage of the data pipeline you are in. SQL normalization is a different concept from normalizing (standardizing) data in a statistical sense, and also different from normalizatio (feature scaling) data for deep learning.

Conventionally, one would ensure the integrity of the information downloaded from source 1st before proceeding to spend time on the heavy post-processing efforts.This may entail the use of CONTROL TOTALS (e.g. by category groups for a rough check) to compare with what was download from source.The following questions arise: (1) Are these CONTROL TOTALS normally available from source? (2) Exercises esp for the DELETE() function in SQL (point applies to all functions that alter datasets) recommends the use of a subsequent ROWCOUNT to ensure that the intended number of ROWshave been deleted. But this presumes that we know the number beforehand and is not normally the case esp with large data files.  So is there a command to save our DataFrames (as DataFrames. Most methods showed converts to other formats)  before the DELETE so that we can retrieve from mistakes made with interim file saves made?

1) Yes, your db admin should be able to provide this to you. If you are using a public dataset, it should be included in a well documented data dictionary. If you are collecting your own data, then it will be one more job to do be done during the data collection process. 

2) Yes. You can always save a dataframe using df.to_pickle(file_name).

One of the requirements AISG sets for AI apprenticeship programme is that participant has to be a degree holder. What is the basis for this requirement and will AISG ever consider non degree holders into this programme if they possess the necessary skills and abilities?

Right now, the AI Apprenticeship Programme is positioned as a post-grad programme. At the moment, we’re not looking at lowering the entry requirement. 

Are applicants to the apprenticeship programme require to be on GitHub with open-source projects to qualify?

You don’t have to be on GitHub with open-source projects, but it’s definitely a plus, especially given the quality of our candidate pool.

Any suggestion on the direction of participants beyond this course, other than the apprenticeship programme?

A lot of self-study will be in order should you decide to pursue this track but without going for a Master’s or a PhD. Do check out this FAQ post for a better basic understanding. 

This question is more course admin related. Will the DataCamp courses we have completed still be accessible to us after the end of the one-year premium usage period? Will the code we wrote still be accessible?

Unfortunately not. However, you can copy the code you completed over to your personal repo (such as GitHub or Google Colab), or sign up with DataCamp on your own. 

Any recommendation on hackathons that we can participate in?

Check out Shopee’s National Data Science Challenge and HackOMania by GeeksHacking. 

Links from Session

Automate the Boring Stuff
The Quiet Listener
Artificial Racer 

Watch AMA #8 here for a quick recap. 

Leave a Comment

Do NOT follow this link or you will be banned from the site!