Your most burning questions from AMA #5.
(1) Are CONTROL TOTALS normally available from source?
(2) Exercises esp for the DELETE() function in SQL (point applies to all functions that alter datasets) recommend the use of a subsequent ROWCOUNT to ensure that the intended number of ROWs have been deleted. But this presumes that we know the number beforehand and is not normally the case esp with large data files. So is there a command to save our DataFrames (as DataFrames. Most methods showed converts to other formats) before the DELETE so that we can retrieve from mistakes made with interim file saves made?
1. It depends. Hash/Control totals are not always available.
2. Yes, you can save your dataframe to other formats such as pickles or feather formats. However, I do not recommend saving the dataframe in CSV format as CSV may change the format of the data.
In the data modelling context , ‘NORMALISATION’ pertains to the process of tidying up the respective Data Tables to ensure qualities such as no duplication of same info in different Tables, each record being uniquely identifiable by its primary key etc before they are used.I see the term Normalisation also being used for pre-processing during the Data Science sessions. Is this in the same contextual purposes or just a loose similarity as it seem to refer to ‘taming the data sample values spread’?
If you are referring to Normalization of Databases, it means slightly different things than normalizing the data for data science works.
Normalization is a technique of organizing the data in the database. It is a systematic approach of decomposing tables to eliminate data redundancy(repetition) and undesirable characteristics like Insertion, Update and Deletion Anomalies. It is a multi-step process that puts data into tabular form, removing duplicated data from the relation tables.
Check out this link for more information.
How do table alias names affect performance?
If you do not name your tables, future table changes may break a query, especially if you are adding columns to your table, separating, or making joins.
What are some resources to practise SQL?
Understand the very basic syntax:
select (if) count(distinct()) case () when then () else () end sd ()
from () join() union() a horizontal version of join()
If you get stuck:
Try to write a query on smaller pieces of data
Ask yourself: How can I solve this problem in a single query?
Can I filter, aggregate or join some tables?
Can I try using excel or some graphic interface first then reimplement in SQL?
For some tasks, the same data analysis/modelling task can be done within an SQL query or within Python. How do people decide how to allocate jobs between SQL and Python?
It depends on where the data are stored; generally speaking, if it is stored in a relational database/on a server, you need SQL.
I am trying to compare selecting all the data into a dataframe first and using pandas from then on, versus doing filtering/grouping/column generation/joins in SQL and only creating dataframe at last step. Is it always faster/better in some other way to do it in SQL, with the only disadvantage being verbose code?
SQL can be used to do filtering/joins etc on a massive database as compared to pandas. A good practice would be to use SQL to filter your dataset down to a manageable size before using pandas to handle the more complex operations or visualizations.
If the datasets become too huge beyond what SQL can handle, is there other options? Say Hadoop?
Yes, Hadoop is a distributed database system that can handle things larger than what SQL is capable of. In addition, it can also handle unstructured data. The benefit of SQL over Hadoop is that it can handle real-time data processing better than Hadoop.
I’ve see K-nearest neighbors and linear regression done purely in SQL. Why do people do that rather than using sklearn or Python? Is SQL particularly suited to instance-based learning algorithms?
For all learning algorithms, it is possible to do the same on any language (i.e. building from scratch). However, the benefit of doing in a certain language (e.g. Python) is that there is pre-built library (e.g. sklearn) which you can use to build the model.
When do we normalize across the row and when to normalize across the column?
It depends on the dataset and your objective. Usually you will want to normalize via column, since all data for the same feature is stored in the same column. Normalizing via row is rare, since each row is a sample. it is atypical to normalize one sample but not the rest.
After raw datasets have been preprocessed and normalized, would database normalization be performed to facilitate analyses?
Usually you will want to leave the master data untouched and normalise the data as per your objective. So you will want to access the master data, download whatever is relevant, then normalise them accordingly. To further elaborate, usually normalising will make it into a normal distribution. This will make a value of 3 to become e.g. 1.434434…, so usually you will want to leave the master data untouched. Different models will require different normalisation; for instance, if you are doing image classification with transfer learning, you need to normalise the colour channel to the specific values that are used to train the ‘pre-trained model’. And you need to be clear why you need the data to be normalised 🙂
What’s the difference between pickle and save to csv? is it faster?
Pickle is like ‘zipping’ the file, so someone else can ‘unpickle’ to ‘unzip’ the file when they are reading it.
In the unsupervised learning lesson, the exercise did Normalizer() rather than StandardScaler() to normalize across rows (companies) before Kmeans. Columns were stock price movements through 5 years. Why was it normalized by rows here? Doesn’t Kmeans process columns?
Good point. The purpose of the Normalizer() is to make all the stock prices within the same range such that a company with a stock price of e.g. $400 will not have a larger impact than a company with a price of $0.45.
I find that if I am using Colab and if I want to access my file in my local drive, I have to run some code before I can do it. It is quite troublesome. Are there simpler ways to do it?
Sadly, no; it is the syntax that Colab requires 🙂 You can refer to this site; I think it might answer your question.
Watch AMA #5 here for a quick recap.