Your most burning questions from AMA #3.
What are the advantages of using pandas dataframes vs using an excel sheet?
If you use Pandas, you can use its libraries and functions such as ‘apply’ which is much faster than native Excel functions. You can also use block chunks in Pandas.
If you have GBs of flat file data, pandas can’t really hack those volume of data either (the machine can run out of RAM). Are there any recommendations to cope with large volume of data, ie, to do clustering?
You can do a readby in chunks. You can refer to this link: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html
Dask is a Python package that is meant to handle out of RAM data. As a first step though, it’s always easier to subsample your data and work from there. Working with out-of-RAM is most of the time more trouble than it’s worth.
How does OAuth work in Twitter API? What are access token, access token secret, consumer key and consumer secret?
You need an account in twitter to get your OAuth token. See step 1: http://socialmedia-class.org/twittertutorial.html
As for a python twitter library, there are many libraries available. The access tokens and consumer keys are used to authorize your access to twitter API.
A dataframe consists of multiple columns or Series. May I know if a Series is also a numpy array?
A Series is a numpy array, but with a column name and an index.
Sometimes, I get very confused whether the command I need to use is a function, method or attribute. Is there any ways or hints to know which type the command is or it is just memory work?
You can try and when you get an error, you can adjust from there.
Of course, using an attribute/method often will help you become familiar with it, and help you code faster.
You can use type:
help(whatever you want to check)
but the output might be relatively verbrose. Alternatively, try rely on google, or look through the online documentation
If you don’t need to pass any parameters in into () to compute something but just want to get information , it is most likely an attribute, like df.shape, otherwise it would be a function/ method. A method is like a function attached to an instance of a class. A function is not. For eg. np.mean(array) is the function and array.mean() is the corresponding function method.
Could someone help explain how python, numpy and pandas relate to each other?
Python is a programming language; numpy is a Python library meant for scientific computing; Pandas is a higher-level library built on top of numpy for data processing.
What are some of the best practices for data cleaning and handling missing data?
It depends on the dataset, the conditions, and the approach you want to take. There are some methods you can apply, but the choice on what methods to use will come with experience. A good place to start is to look for null values.
How can I change the values of multiple “cells” in my pandas dataframe?
It depends on what you want to change; the values of the data or the names of the column. For the values, you can change the value by math operation. For columns, you need to create a list of name and use it to overwrite existing columns.
Can I use excel-style “pivot-tables” in Pandas?
How do I create a dataframe from scratch?
First, you build a dictionary. Then you use Pandas’ DataFrame methods.
What is loc and iloc?
Loc is used when you want to make a reference to the name of the column or row, while iloc is used when you want to make reference by the position of the column or row.
How would you know if your data is too dirty to be cleaned?
No data is too dirty to be cleaned; it is a matter of understanding the data and finding the best method to deal with it. Like the saying goes, when life gives you lemons, make lemon juice out of it. But if the lemon is not fresh (garbage data), then you can’t make fresh lemon juice!
What are some good sources of data in Singapore?
Try this Google search engine specifically built for finding datasets: https://toolbox.google.com/datasetsearch
How much melting is done in practice? I assume data would be coming from de-normalized tables in databases, so these tables are just joined and analysed directly? Is melting less applicable for data from databases but more for reports? Is melting a need arising from bad data collection/data storage design and can be completely avoided from the get-go with good planning?
In essence, it is the reshaping of the dataset for easier analysis. Best way to understand is to read the docs and practice! Data can come in many forms and will need to presented in many other forms depending on the story that needs to be told. For more information on how to use the melt function, click here: http://www.datasciencemadesimple.com/reshape-wide-long-pandas-python-melt-function/
What exactly is tidy data? I see two contradictory messages. The Hadley Wickham tidy data paper – “Section 3.1. Column headers are values, not variable names” says “To tidy it, we need to melt, or stack it. In other words, we need to turn columns into rows”. However, Cleaning Data in Python- 2. Tidying data for analysis – Recognizing tidy data exercise confirmation message says :”Notice that the variable column of df2contains the values Solar.R, Ozone, Temp, and Wind. For it to be tidy, these should all be in separate columns, as in df1″, which says unmelting and turning row to column is tidy. Also, the video introduces analysis-friendly vs report-friendly shapes. What is the relationship between the 3 dichotomies? analysis/report friendly shape —– melted/pivoted —– tidy/untidy?
It might help to think of these attributes not as dichotomies but as a spectrum. Some data could be presented in a “messy” fashion for the purpose of efficient storage or efficient computation (think of tables generated using groupby or agg functions), but remain extremely useful within their narrow purpose. Of course, to use this data for a different purpose, a great deal of effort will be needed to unpack the data. So, not all messy data is bad.
Tidy data, however, is a standard way of mapping the meaning of a dataset to its structure. In tidy data 1) Each variable forms a column. 2) Each observation forms a row. 3) Each type of observational unit forms a table. Those proficient in SQL will recognize that this is basically the rules for 3rd Normal Form.
Melting and pivoting are simply methods to apply onto datasets to make them ‘tidier’ or ‘messier’.
Any resources to learn more regex? I felt the interactive https://regexone.com/ was good but zero-length matches and \b are still hard to understand. Any rules of thumb to write good regex? Should * not be used at all because it can give 0 length matches? (false positive matches).
There are many great resources online, the important thing is to practice until you hit the proficiency level you desire.
The data cleaning, df.apply lesson showed regex as a data validation tool. Practically do people use regex together with asserts/exceptions as an automated alerting/input data quality monitoring tool? What if there are patterns in data from future that we want to match but don’t know of at the current moment of designing the regex pattern for this system? Are there design philosophies that could maybe maximize the flexibility/ability to react to environmental changes of a regex pattern? Or regex is a very hardcoded thing and has to be manually updated every time?
For simple data validation, regex can be a quick and easy tool to use. However, we have to keep in mind that regex code can become very complex, and updating a complex regex pattern can become very taxing for new person trying to update the codebase (or even yourself in the future). For more complex requirements, it might be better to use a parser instead of hard-coded regex.
Are regex generators reliable/helpful or should I always write my own regex?
Generally, it’s always better to know how to write your own regex, but for convenience, you can always use your favourite regex generator.
Lesson on duplicates say they use up unnecessary amounts of memory and bias any analysis results. Then why do people do bootstrap resampling to increase the number of observations of the less frequently occurring class when building classification models. Also, isn’t anomaly detection full of duplicate data? (variables contain the same values through each time point until something happens). What is the advantage of having unique data rows? I somehow feel it’s related to the model chosen.
If you have a large class imbalance (Eg: 90/10), it become very difficult to build a prediction model because the model would already score 90% simply by predicting the 90% class every single time!
There are only two ways to solve this class imbalance: reduce the number of observations in the 90% group, or increase the number of observations in the 10% group so that they both attain a 1:1 ratio in the dataset. In a large enough dataset, reducing the observations could work, but if your dataset is small, the only way might be to increase the observations in the 10% group. This is where bootstrap resampling comes in – it is a statistical technique that allows us to increase the size of a dataset with as little bias as possible.
With regard to duplicate data, removing truly duplicate data is a good practice (eg: you don’t want to conduct an analysis on customer data saying that Peter came into the store twice and bought 1 shirt each time, when he only came in once). However, duplicate looking data can also be unique (Peter came in the store once and bought 1 shirt, 2 days in a row).