merge与concat的区别在于,merge需要依据某一共同的行或列来进行合并
使用pd.merge()合并时,会自动根据两者相同column名称的那一列,作为key来进行合并。
注意每一列元素的顺序不要求一致
1.一对一合并
import numpy as np import pandas as pd from pandas import Series,DataFrame df1 = DataFrame({'employee':['Bob','Jake','Lisa'], 'group':['Accounting','Engineering','Engineering'], }) df2 = DataFrame({'employee':['Lisa','Bob','Jake'], 'hire_date':[2004,2008,2012], }) display(df1,df2) 输出: employee group 0 Bob Accounting 1 Jake Engineering 2 Lisa Engineering employee hire_date 0 Lisa 2004 1 Bob 2008 2 Jake 2012 df3 = pd.merge(df1,df2) df3 输出: employee group hire_date 0 Bob Accounting 2008 1 Jake Engineering 2012 2 Lisa Engineering 2004
多对一合并
df3 = DataFrame({ 'employee':['Lisa','Jake'], 'group':['Accounting','Engineering'], 'hire_date':[2004,2016]}) df4 = DataFrame({'group':['Accounting','Engineering','Engineering'], 'supervisor':['Carly','Guido','Steve'] }) display(df3,df4,pd.merge(df3,df4)) 输出: employee group hire_date 0 Lisa Accounting 2004 2 Jake Engineering 2016 group supervisor 0 Accounting Carly 1 Engineering Guido 2 Engineering Steve employee group hire_date supervisor 0 Lisa Accounting 2004 Carly 1 Jake Engineering 2016 Guido 2 Jake Engineering 2016 Steve
多对多合并
df1 = DataFrame({'employee':['Bob','Jake','Lisa'], 'group':['Accounting','Engineering','Engineering']}) df5 = DataFrame({'group':['Engineering','Engineering','HR'], 'supervisor':['Carly','Guido','Steve'] }) display(df1,df5,pd.merge(df1,df5))#多对多 display(pd.concat([df1,df5])) 输出: employee group 0 Bob Accounting 1 Jake Engineering 2 Lisa Engineering group supervisor 0 Engineering Carly 1 Engineering Guido 2 HR Steve employee group supervisor 0 Jake Engineering Carly 1 Jake Engineering Guido 2 Lisa Engineering Carly 3 Lisa Engineering Guido employee group supervisor 0 Bob Accounting NaN 1 Jake Engineering NaN 2 Lisa Engineering NaN 0 NaN Engineering Carly 1 NaN Engineering Guido 2 NaN HR Steve
(1)key的规范化
使用on=显式指定哪一列为key
df1 = DataFrame({'employee':['Jack',"Summer","Steve"], 'group':['Accounting','Finance','Marketing']}) df2 = DataFrame({'employee':['Jack','Bob',"Jake"], 'hire_date':[2003,2009,2012], 'group':['Accounting','sell','ceo']}) display(df1,df2,pd.merge(df1,df2),pd.merge(df1,df2,on = 'employee')) 输出: employee group 0 Jack Accounting 1 Summer Finance 2 Steve Marketing employee group hire_date 0 Jack Accounting 2003 1 Bob sell 2009 2 Jake ceo 2012 employee group hire_date 0 Jack Accounting 2003 employee group_x group_y hire_date 0 Jack Accounting Accounting 2003
使用left_on和right_on指定左右两边的列作为key
df1 = DataFrame({'employee':['Bobs','Linda','Bill'], 'group':['Accounting','Product','Marketing'], 'hire_date':[1998,2017,2018]}) df5 = DataFrame({'name':['Lisa','Bobs','Bill'], 'hire_dates':[1998,2016,2007]}) display(df1,df5,pd.merge(df1,df5,left_on = 'employee',right_on = 'name')) 输出: employee group hire_date 0 Bobs Accounting 1998 1 Linda Product 2017 2 Bill Marketing 2018 hire_dates name 0 1998 Lisa 1 2016 Bobs 2 2007 Bill employee group hire_date hire_dates name 0 Bobs Accounting 1998 2016 Bobs 1 Bill Marketing 2018 2007 Bill
(2)内合并与外合并
内合并:只保留两者都有的key(默认模式)
df6 = DataFrame({'name':['Peter','Paul','Mary'], 'food':['fish','beans','bread']} ) df7 = DataFrame({'name':['Mary','Joseph'], 'drink':['wine','beer']}) display(df6,df7,pd.merge(df6,df7)) 输出: food name 0 fish Peter 1 beans Paul 2 bread Mary drink name 0 wine Mary 1 beer Joseph food name drink 0 bread Mary winee
外合并 how='outer':补NaN
df6 = DataFrame({'name':['Peter','Paul','Mary'], 'food':['fish','beans','bread']} ) df7 = DataFrame({'name':['Mary','Joseph'], 'drink':['wine','beer']}) display(df6,df7,pd.merge(df6,df7,how='outer')) 输出: food name 0 fish Peter 1 beans Paul 2 bread Mary drink name 0 wine Mary 1 beer Joseph food name drink 0 fish Peter NaN 1 beans Paul NaN 2 bread Mary wine 3 NaN Joseph beer
左合并、右合并:how='left',how='right'
df6 = DataFrame({'name':['Peter','Paul','Mary'], 'food':['fish','beans','bread']} ) df7 = DataFrame({'name':['Mary','Joseph'], 'drink':['wine','beer']}) display(df6,df7,pd.merge(df6,df7,how='left')) display(df6,df7,pd.merge(df6,df7,how='right')) 输出: food name 0 fish Peter 1 beans Paul 2 bread Mary drink name 0 wine Mary 1 beer Joseph food name drink 0 fish Peter NaN 1 beans Paul NaN 2 bread Mary wine food name drink 0 bread Mary wine 1 NaN Joseph beer
(3)列冲突的解决
当列冲突时,即有多个列名称相同时,需要使用on=来指定哪一个列作为key
df8 = DataFrame({'name':['Peter','Paul','Mary'],'rank':[1,2,3]}) df9 = DataFrame({'name':['Peter','Paul','Mary'],'rank':[5,6,7]}) display(df8,df9,pd.merge(df8,df9,on = 'name',suffixes=['rank_L','rank_R'])) 输出: name rank 0 Peter 1 1 Paul 2 2 Mary 3 name rank 0 Peter 5 1 Paul 6 2 Mary 7 name rankrank_L rankrank_R 0 Peter 1 5 1 Paul 2 6 2 Mary 3 7
回顾:Series/DataFrame运算与ndarray运算的区别
ndarray有广播,通过重复已有值来计算
a = np.array([1,2,3]) b = np.array([2]) display(a,b,a+b) 输出: array([1, 2, 3]) array([2]) array([3, 4, 5])
Series与DataFrame没有广播,如果对应index没有值,则记为NaN;或者使用add的fill_value来补缺失值