The power of pandas transform method.
The methods of statistical learning have been evolved with the power of programming languages. People start to employ these learning models more conveniently now with the help of built-in algorithms and interfaces (Tabladeu, Power BI, Cloudera etc.). However, the emerging bottleneck may be data wrangling/cleaning phase of data science.
Forbes: Data scientists found that they spend most of their time massaging rather than mining or modeling data.
In fact, there is also data engineering and data scientist differences. Not all data engineers need to be good at data science but all data scientist should be good at data engineering.
I am interested in statistical learning since 2013 and I feel that data wrangling and cleaning is the primary bottleneck candidate in data analysis.
Therefore, i consider to post pandas groupby-transform method in detail since i employ most of my complex problems. Lets start…
Transform method is a method of groupby operation. A simple groupby operation is,
Group=df.groupby([“A”,”B”])
This is only a groupby object. We need to call iterable objects to obtain ouput.
Group[“C”].size() (min(), max(), mean() are also available)
This gives us the size of each group in “C” column environment. Let talk about .tranform operation.
The. .transform. method takes a function that returns an object with the same index (and the same number of rows) as was passed into it. Because it has the same index, we can insert it as a column. The. .transform. method is useful for summarizing information from the groups and then adding it back to the original DataFrame.
In fact the most important idea of .transform method is that it takes Series and return Series in the same size of in that group.
def percent_loss(s): return ((s – s.iloc[0]) / s.iloc[0]) * 100
As can be seen above percent_loss function takes a Series. The return value will also be a Series of the same size.
The importance of .transform is you can manipulate each group within their scope. There is no need to operate individually.
Lets dive into tranform method more.
Group=df.groupby(“Name”)
Group.transform(lambda x: x+1)
At first you might think that function iterate but it doest. Suppose that Group Name consist of one other column with and values are [2,3,5,8] for one of the group name called “Bob”.
.tranform method get this as Series and commit [2,3,5,8] +1. There is no iteration. This is the tricky part of .transform method.
See you in the next post.