Generating Targeted Insights
Once we have identified the KPIs for our analysis, we can proceed to make insights with respect to only those variables that affect the bottom line of the KPIs.
Selecting and Renaming Attributes
After we have explored our attributes, we might feel like the variation in the data for a certain attribute could be understood more clearly if it were focused on individually. As explained in detail in the previous chapter, we can select parts of data in pandas through the following methods:
- [cols]: This method selects the columns to be displayed.
- loc[label]: This method selects rows by label or Boolean condition.
- loc[row_labels, cols]: This method selects rows in row_labels and their values in the cols columns.
- iloc[location]: This method selects rows by integer location. It can be used to pass a list of row indices, slices, and so on.
For example, we can select Revenue, Quantity, and Gross Profit columns from the United States in the sales DataFrame, as follows:
sales.loc[sales['Retailer country']=='United States', ['Revenue', 'Quantity', 'Gross profit']].head()
This should get you the following output:
Figure 2.24: Sub-selecting observations and attributes in pandas
Sometimes, the insight we want to deliver might not be captured in the way we want even after selecting it. For example, the attribute might be named differently from what we have seen, making the analysis harder to interpret. It is possible to rename columns and indexes in pandas by using the rename function.
The rename function takes a dictionary as an input, which contains the current column name as a key and the desired renamed attribute name as value. It also takes the axis as a parameter, which denotes whether the index or the column has to be renamed. It takes the index as the default. The following code renames the Revenue column to Earnings:
sales.rename({'Revenue':'Earnings'}, axis = 'columns').head()
The DataFrame will now appear as follows:
Figure 2.25: Output after using the rename function on sales
Transforming Values
Even after sub-selecting the right features in the DataFrame and renaming them, we can sometimes still get lost in unnecessary details such as floating-point precision, instead of looking at the bigger picture. To avoid that, we can transform numerical data to categorical data, and vice versa, in order to make the understanding clearer.
For example, consider the Unit cost field in sales.csv and check the spread of data in that column:
sales['Unit cost'].quantile([0.0, 0.25,0.5,0.75,1])
On using the above code, we get the following output:
Figure 2.26: The default spread in columns across the Unit cost field
The astype() function provides an API to coerce a pandas DataFrame column to a certain type:
sales['Unit cost'] = sales['Unit cost'].astype('category')
sales.dtypes
The above snippet results in the following output:
Figure 2.27: Changing the datatype of a pandas column by coercion
We can also change the actual values of the data with something else through a custom transformation, and again use the map and apply functions on the DataFrame to encode the label into a different type of data. For instance, we can write a custom function in Python to transform the numeric Unit cost attribute into a categorical column:
def cat_gen(x):
if pd.isnull(x):
return np.nan
elif x<=2.76:
return "cheap"
elif 2.76<x<=9.0:
return "medium"
elif 9.0<x<=34.97:
return "moderate"
else:
return "expensive"
sales['Cost category'] = sales['Unit cost'].map(cat_gen)
sales['Cost category'].value_counts(dropna = True)
This will give us the following output:
Figure 2.28: Converting numerical data to categorical data using custom transformations
Exercise 7: Targeting Insights for Specific Use Cases
A newspaper company has given you their rates for putting up advertisements in cities P, Q, and R. You need to compare the advertisement rates across the cities and figure out the cost of displaying advertisements on every day of the week in all three cities, and the mean advertisement prices for each day in those three cities. Let's revisit the methods we used in the previous chapter and use it to solve this problem:
- Read the newpaper_prices.csv CSV file and look at it:
df = pd.read_csv('newspaper_prices.csv', index_col = 'Day')
df
You should get the following output:
Figure 2.29: The first few rows of newspaper_prices.csv
- Apply the sum function on the DataFrame to get the sum across different observations:
df.sum()
You should get the following output:
Figure 2.30: The sum of values in each column
You will observe that Q provides the highest number of views but is also the most expensive.
- Now select the columns we want and apply the mean function to them:
df[['P Cost', 'Q Cost', 'R Cost']].mean(axis = 1)
This should give you the following output:
Figure 2.31: Applying a function on a different axis
As we have taken the mean across different cities, we have received the average cost for each day in different cities. We can infer that the average cost of newspapers on Fridays, Saturdays, and Sundays is more than on Mondays, Tuesdays, and Wednesdays.
Reshaping the Data
Other than just focusing on the relevant parts of the data, we can also extract a lot more information from a dataset by changing how the attributes and observations are arranged. These reshaped datasets represent different indices, fields, and so on for the same data, but by analyzing them, we can understand the relationships in our data clearly and we can easily see what kind of values occur together in the same location and how. Let's consider the data in CTA_comparison.csv in the Lesson02 folder, stored in a DataFrame cta as follows:
cta = pd.read_csv('CTA_comparison.csv')
cta
The DataFrame should appear as follows:
Figure 2. 32: The entire data in CTA_comparison.csv
We can reshape the DataFrame by just changing the index of the DataFrame through the set_index function in pandas. For example, we can set the index of the CTA_comparison.csv data to CTA Variant using the following code:
cta.set_index('CTA Variant')
The DataFrame will now appear as follows:
Figure 2.33: Changing the index with the help of set_index
You can also reshape data by creating a hierarchy. This can be done in pandas easily, by passing multiple columns to the set_index function. For instance, we can set the CTA Variant and views as the index of the DataFrame using the set_index method as follows:
cta.set_index(['CTA Variant', 'views'])
The DataFrame will now appear as follows:
Figure 2.34: Hierarchical Data in pandas through set_index
The same hierarchy can also be created more clearly, by passing multiple columns to the groupby function:
cta_views = cta.groupby(['CTA Variant', 'views']).count()
cta_views
This gives the following output:
Figure 2.35: Grouping by multiple columns to generate hierarchies
Using this hierarchy, we can easily have a look at groups that occur only in some scenarios and not in others. For instance, CTA Variant B only gets 800 views in all 3 cases, whereas variants A and C get only these two kinds of views.
It is possible to reshape this dataset further too. We can switch the indices from rows to columns and vice versa. This basic reshape transformation is achieved in pandas by using the unstack and stack functions, explained here:
- unstack(level): This function moves the row index with the name or integral location level to the innermost column index. By default, it moves the innermost row:
h1 = cta_views.unstack(level = 'CTA Variant')
h1
This gives the following output:
Figure 2.36: Example of unstacking DataFrames
We can see that the row index has changed to only views while the column has got the additional CTA Variant attribute as an index along with the regular time and sales columns.
- stack(level): This function moves the column index with the name or integral location level to the innermost row index. By default, it moves the innermost column:
h1.stack(0)
This gives the following output:
Figure 2.37: Example of stacking a DataFrame
Now the stack function has taken the other sales and time column values to the row index and only the CTA Variant feature has become the column index.
Exercise 8: Understanding Stacking and Unstacking
You are the owner of a website that randomly shows advertisements A or B to users each time a page is loaded. If an advertisement succeeds in getting a user to click on it, the converted field gets the value 1, otherwise it gets 0. The data for this is present in the conversion_rates.csv file in the Lesson02 folder, and you need to use pandas to find relationships, if any, between the variables in the dataset. Create a DataFrame, df, that can access the conversion ratio of advertisement A as df['A']['conversion_ratio'].
Your visualization code requires the number of advertisements viewed, converted, and the conversion_ratio to be in the row indices and both the variants to be in the columns. Create a DataFrame to work in that scenario:
- Import pandas into the console and read the conversion_rates.csv file into a pandas DataFrame called data, as shown here:
import pandas as pd
data = pd.read_csv('conversion_rates.csv')
- Look at the first few rows of the DataFrame using the head method:
data.head()
You should get the following output:
Figure 2.38: The first few rows of conversion_rates.csv
- Group the data and count the number of conversions, storing the result in a DataFrame named converted:
converted = data.groupby('group').sum()
converted
You will get the following output:
Figure 2.39: Count of converted displays
- Group the data and count the number of times each advertisement was displayed. Store this in a DataFrame named viewed and rename the column name to viewed:
viewed = data.groupby('group').count().rename({'converted':'viewed'}, axis = 'columns')
viewed
You will get the following output:
Figure 2.40: Count of number of views
- Combine the converted and viewed datasets in a new DataFrame, named stats, as shown here:
stats = converted.merge(viewed, on = 'group')
stats
This gives the following output:
Figure 2.41: Combined dataset
- Create a new column called conversion_ratio to get the ratio of converted ads to the number of times the ads were displayed:
stats['conversion_ratio'] = stats['converted_x']/stats['converted_y']
stats
This gives the following output:
Figure 2.42: Adding an additional column to stats
- Create a DataFrame where group A's conversion rate is accessed as df['A']['conversion_rate']. Use the stack function for this operation:
df = stats.stack()
df
This gives the following output:
Figure 2.43: Understanding the different levels of your dataset
- Check whether you're able to access the desired value using the following code:
df['A']['conversion_rate']
You should get a value close to 0.08737.
- Reverse the rows with the columns in the stats DataFrame with the unstack() function twice:
stats.unstack().unstack()
This gives the following output:
Figure 2.44: Reversing rows with columns
Congratulations! You have reshaped the data in the desired manner. You can now bring any data to a format that you like. pandas also provides for a simpler way to reshape that allows making comparisons while analyzing data very easy. Let's have a look at it in the next section.
Pivot Tables
Creating a pivot table is a special case of stacking a DataFrame. The pivot function, which is used to create a pivot table, takes three arguments and creates a new table, whose row and column indices are the unique values of the respective parameters.
For example, consider the same data DataFrame used in the previous exercise:
Figure 2.45: The first few rows of the dataset being considered
We can use the pivot function to change the columns to values in group and see if a certain index converted it or not, as follows:
data.pivot(columns = 'group', values='converted').head()
This gives the following output:
Figure 2.46: Data after being passed through the pivot command
In the preceding figure, note that the columns and indices have changed but the observations individually have not. We can see that the data that had either a 0 or 1 value remains as is, but the groups that were not considered have their remaining values filled in as missing.
There is also a function called pivot_table, which aggregates fields together using the function specified in the aggfunc parameter and creates outputs accordingly. It is considered to be an alternative to aggregations such as groupby functions.
For instance, let's apply the pivot_table function to the same DataFrame to aggregate data:
data.pivot_table(index='group', columns='converted', aggfunc=len)
This gives the following output:
Figure 2.47: Applying pivot_table to data
Note that the use of the len argument results in columns 0 and 1 that show how many times each of these values appeared in each group.
Remember that, unlike pivot, it is essential to pass the aggfunc function when using the pivot_table function.