Haakon's weblog

Adding Pandas DataFrames with Different Multiindices

I ran into a strange bug due to forgetting some subtle name differences between two data frames. The example below shows what happens. Instead of failing, result2 has three levels in its index while result1 has only two levels because the

import pandas as pd
import numpy as np

# Create the first DataFrame with a MultiIndex
index1 = pd.MultiIndex.from_tuples([('A', 1)], names=['letter', 'number'])
df1 = pd.DataFrame({'value': [1]}, index=index1)

# Create the second DataFrame with a different MultiIndex
index2 = pd.MultiIndex.from_tuples([('A', 1)], names=['letter', 'decimal'])
df2 = pd.DataFrame({'value': [4]}, index=index2)

index3 = pd.MultiIndex.from_tuples([('A', 2)], names=['letter', 'decimal'])
df3 = pd.DataFrame({'value': [4]}, index=index3)
# Add the two DataFrames together
result = df1 + df2
result2 = df1 + df3

print("DataFrame 1:")
print(df1.to_markdown())
print("Index levels:" , df1.index.names)
print("\nDataFrame 2:")
print(df2.to_markdown())
print("Index levels:" , df2.index.names)
print("\nDataFrame 3:")
print(df3.to_markdown())
print("Index levels:" , df3.index.names)
print("\nResult of adding DataFrame 1 and DataFrame 2:")
print(result.to_markdown())
print("Index levels:" , result.index.names)
print("\nResult of adding DataFrame 1 and DataFrame 3:")
print(result2.to_markdown())
print("Index levels:" , result2.index.names)
DataFrame 1:
|          |   value |
|:---------|--------:|
| ('A', 1) |       1 |

DataFrame 2:
|          |   value |
|:---------|--------:|
| ('A', 1) |       4 |

Result of adding DataFrame 1 and DataFrame 2:
|          |   value |
|:---------|--------:|
| ('A', 1) |       5 |
Index levels: ['letter', 'number']

Result of adding DataFrame 1 and DataFrame 3:
|             |   value |
|:------------|--------:|
| ('A', 1, 2) |       5 |
Index levels: ['letter', 'number', 'decimal']

So what happens it that the decimal level disappears in the first case, because it happens to have the same value as the number level in df1. However, in the second case they have different values so the decimal level is added to the index. The reason for this is that pandas doesn't care about the index names, only the values. This tripped me up a bit, but I can see that both choices make sense. Aligning indices based on values means less renaming is necessary, which makes it simpler to use and less need to worry about index names. The flip side is that you can end up with subtle bugs when a third index level is introduced.