Adding Pandas DataFrames with Different Multiindices
I ran into a strange bug due to forgetting some subtle name differences between two data frames. The example below shows what happens. Instead of failing, result2
has three levels in its index while result1
has only two levels because the
import pandas as pd import numpy as np # Create the first DataFrame with a MultiIndex index1 = pd.MultiIndex.from_tuples([('A', 1)], names=['letter', 'number']) df1 = pd.DataFrame({'value': [1]}, index=index1) # Create the second DataFrame with a different MultiIndex index2 = pd.MultiIndex.from_tuples([('A', 1)], names=['letter', 'decimal']) df2 = pd.DataFrame({'value': [4]}, index=index2) index3 = pd.MultiIndex.from_tuples([('A', 2)], names=['letter', 'decimal']) df3 = pd.DataFrame({'value': [4]}, index=index3) # Add the two DataFrames together result = df1 + df2 result2 = df1 + df3 print("DataFrame 1:") print(df1.to_markdown()) print("Index levels:" , df1.index.names) print("\nDataFrame 2:") print(df2.to_markdown()) print("Index levels:" , df2.index.names) print("\nDataFrame 3:") print(df3.to_markdown()) print("Index levels:" , df3.index.names) print("\nResult of adding DataFrame 1 and DataFrame 2:") print(result.to_markdown()) print("Index levels:" , result.index.names) print("\nResult of adding DataFrame 1 and DataFrame 3:") print(result2.to_markdown()) print("Index levels:" , result2.index.names)
DataFrame 1: | | value | |:---------|--------:| | ('A', 1) | 1 | DataFrame 2: | | value | |:---------|--------:| | ('A', 1) | 4 | Result of adding DataFrame 1 and DataFrame 2: | | value | |:---------|--------:| | ('A', 1) | 5 | Index levels: ['letter', 'number'] Result of adding DataFrame 1 and DataFrame 3: | | value | |:------------|--------:| | ('A', 1, 2) | 5 | Index levels: ['letter', 'number', 'decimal']
So what happens it that the decimal
level disappears in the first case, because it happens to have the same value as the number
level in df1
. However, in the second case they have different values so the decimal
level is added to the index. The reason for this is that pandas doesn't care about the index names, only the values. This tripped me up a bit, but I can see that both choices make sense. Aligning indices based on values means less renaming is necessary, which makes it simpler to use and less need to worry about index names. The flip side is that you can end up with subtle bugs when a third index level is introduced.