AMA #4 Questions & Answers Part 1.

Your most burning questions from AMA #4.

Just backtracking to the chapter on importing data from websites using BeautifulSoup. What if the site uses javascript and the data is not immediately available in the html source of the page? Are there any python packages that deal with this situation? What are your experiences?

From experience, it can be a tricky thing. Make sure to fully read the site’s terms of use and privacy policy before you do any crawling to prevent misuse of data or misunderstandings. If the site uses javascript but has not explicitly disallowed usage of its data, consider writing the site’s owner for access to the data – it’s both polite and legal!

Check out this link for more information:

I noticed from the videos that the two ways of referring to a column – either df[“colname”] or df.colname. Is there any situation where only one of the two ways work? What if the column name contains a dot ‘.’?

Yes, exactly! df.colname only works if they column name does not contain speical characters like ‘.’ or spaces. If the column names contain special characters, you should use df[‘col name’]. It’s good practice to use df[“colname”] to minimize confusion and increase readability.

When using pd.to_numeric() to cast data, is there any way to choose whether to cast to float or int type? Sometimes pandas may not make the best choice.

pd.to_numeric() should be used to convert from Object (String) to number. In order to convert or cast float to integer, use the astype() method.

When should I use Pandas?

As much as you can! Pandas provides you with a comprehensive set of functions to perform complex data manipulation. Data in Pandas dataframes is also more easily manipulated.

Can you help to summarize merge and join, and give tips on how to take note on the onleft, onright, etc.?

Take a look at the link below for a comprehensive answer:

Do the “categories” need be arranged in the right sequence? The plot’s legend is shown otherwise if arranged not in order.

You can sort your legend separately from the order of the categories in the dataframe.

Why are some data plots in log form?

In a plot of multiple elements (eg. stock prices vs. time), an outlying high price will pull the axis towards itself and make all other low prices squeeze together, so you can’t see the prices varying around 10-20 when one stock of 1000 makes the y-axis really big. Also, log is a pre-processing step in a time-series analysis to reduce the variance of the time-series and help it get closer to stationarity before applying ARIMA models.

In Pandas Foundations, the lesson seems to be using reindex to filter rows. Why use df.reindex rather than boolean indexing to select the desired rows/columns? Is it to use the side effect of generating entire rows/columns of NaN to be populated with values later?

Yes, that is one of the uses.

So far, the examples demonstrated in the AMAs are read in csv files. Will you be able to demonstrate an example of reading from PostgreSQL and Hadoop?

You can use Pandas’ ‘read_sql’ function for that. You can use native Python libraries, such as hdfs, to process the data fom Hadoop. All you have to do is to create a dict and create a dataframe with that dictionary object using pd.DataFrame().

AMA #4 Q&A Part 2 here.
Watch AMA #4 here for a quick recap. 

Leave a Comment