pandas描述性统计
数值型数据的描述性统计主要包括了计算数值型数据的完整情况、最小值、均值、中位 数、最大值、四分位数、极差、标准差、方差、协方差等。在NumPy库中一些常用的统计学函数也可用于对数据框进行描述性统计。
np.min 最小值 np.max 最大值 np.mean 均值 np.ptp 极差 np.median 中位数 np.std 标准差 np.var 方差 np.cov 协方差
实例:
import pandas as pdimport numpy as np# 创建DFd = { 'Name':pd.Series(['Tom','James','Ricky','Vin','Steve','Minsu','Jack', 'Lee','David','Gasper','Betina','Andres']), 'Age':pd.Series([25,26,25,23,30,29,23,34,40,30,51,46]), 'Rating':pd.Series([4.23,3.24,3.98,2.56,3.20,4.6,3.8,3.78,2.98,4.80,4.10,3.65])}df = pd.DataFrame(d)print(df)""" Name Age Rating0 Tom 25 4.231 James 26 3.242 Ricky 25 3.983 Vin 23 2.564 Steve 30 3.205 Minsu 29 4.606 Jack 23 3.807 Lee 34 3.788 David 40 2.989 Gasper 30 4.8010 Betina 51 4.1011 Andres 46 3.65"""# 测试描述性统计函数print(df.sum())#(每列)求和"""Name TomJamesRickyVinSteveMinsuJackLeeDavidGasperBe...Age 382Rating 44.92dtype: object"""print(df.sum(1))#1是轴向 按行求和"""0 29.231 29.242 28.983 25.564 33.205 33.606 26.807 37.788 42.989 34.8010 55.1011 49.65dtype: float64"""print(df.mean())#按列求均值"""Age 31.833333Rating 3.743333dtype: float64"""print(df.mean(1))#年龄和评分的均值"""0 14.6151 14.6202 14.4903 12.7804 16.6005 16.8006 13.4007 18.8908 21.4909 17.40010 27.55011 24.825dtype: float64"""
pandas提供了统计相关函数:
count() | 非空观测数量 | |
---|---|---|
2 | sum() | 所有值之和 |
3 | mean() | 所有值的平均值 |
4 | median() | 所有值的中位数 |
5 | std() | 值的标准偏差 |
6 | min() | 所有值中的最小值 |
7 | max() | 所有值中的最大值 |
8 | abs() | 绝对值 |
9 | prod() | 数组元素的乘积 |
10 | cumsum() | 累计总和 |
pandas还提供了一个方法叫作describe,能够一次性得出数据框所有数值型特征的非空值数目、均值、标准差等。
import pandas as pd#Create a Dictionary of seriesd = { 'Name':pd.Series(['Tom','James','Ricky','Vin','Steve','Minsu','Jack', 'Lee','David','Gasper','Betina','Andres']), 'Age':pd.Series([25,26,25,23,30,29,23,34,40,30,51,46]), 'Rating':pd.Series([4.23,3.24,3.98,2.56,3.20,4.6,3.8,3.78,2.98,4.80,4.10,3.65])}#Create a DataFramedf = pd.DataFrame(d)print(df.describe())""" Age Ratingcount 12.000000 12.000000mean 31.833333 3.743333std 9.232682 0.661628min 23.000000 2.56000025% 25.000000 3.23000050% 29.500000 3.79000075% 35.500000 4.132500max 51.000000 4.800000"""print(df.describe(include=['object']))""" Namecount 12unique 12top Gasperfreq 1"""print(df.describe(include=['number']))""" Age Ratingcount 12.000000 12.000000mean 31.833333 3.743333std 9.232682 0.661628min 23.000000 2.56000025% 25.000000 3.23000050% 29.500000 3.79000075% 35.500000 4.132500max 51.000000 4.800000"""
pandas排序
Pandas有两种排序方式,它们分别是按标签与按实际值排序。
import pandas as pdimport numpy as npunsorted_df=pd.DataFrame(np.random.randn(10,2), #十行 两列 随机生成数 index=[1,4,6,2,3,5,9,8,0,7], columns=['col2','col1'])print(unsorted_df)""" col2 col11 2.415183 -1.1324304 -0.572576 0.1351196 1.503749 -0.6664072 -1.292311 -0.9596263 -0.689003 0.1460445 0.139045 -1.2217949 -0.944587 -0.6435558 0.848157 0.4462150 -0.332118 1.2151897 0.061888 0.265198"""
按行标签排序
使用sort_index()
方法,通过传递axis
参数和排序顺序,可以对DataFrame
进行排序。 默认情况下,按照升序对行标签进行排序。
import pandas as pdd = { 'Name':pd.Series(['Tom','James','Ricky','Vin','Steve','Minsu','Jack', 'Lee','David','Gasper','Betina','Andres']), 'Age':pd.Series([25,26,25,23,30,29,23,34,40,30,51,46]), 'Rating':pd.Series([4.23,3.24,3.98,2.56,3.20,4.6,3.8,3.78,2.98,4.80,4.10,3.65])}unsorted_df = pd.DataFrame(d)# 按照行标进行排序sorted_df=unsorted_df.sort_index()print (sorted_df)""" Name Age Rating0 Tom 25 4.231 James 26 3.242 Ricky 25 3.983 Vin 23 2.564 Steve 30 3.205 Minsu 29 4.606 Jack 23 3.807 Lee 34 3.788 David 40 2.989 Gasper 30 4.8010 Betina 51 4.1011 Andres 46 3.65"""# 按照行索引标签进行降序排列sorted_df = unsorted_df.sort_index(ascending=False)print (sorted_df)""" Name Age Rating11 Andres 46 3.6510 Betina 51 4.109 Gasper 30 4.808 David 40 2.987 Lee 34 3.786 Jack 23 3.805 Minsu 29 4.604 Steve 30 3.203 Vin 23 2.562 Ricky 25 3.981 James 26 3.240 Tom 25 4.23"""sorted_df = unsorted_df.sort_index(ascending=True,axis=3)print(sorted_df)""" Age Name Rating0 25 Tom 4.231 26 James 3.242 25 Ricky 3.983 23 Vin 2.564 30 Steve 3.205 29 Minsu 4.606 23 Jack 3.807 34 Lee 3.788 40 David 2.989 30 Gasper 4.8010 51 Betina 4.1011 46 Andres 3.65"""
按某列值排序
像索引排序一样,sort_values()
是按值排序的方法。它接受一个by
参数,它将使用要与其排序值的DataFrame
的列名称。
import pandas as pdd = { 'Name': pd.Series(['Tom', 'James', 'Ricky', 'Vin', 'Steve', 'Minsu', 'Jack', 'Lee', 'David', 'Gasper', 'Betina', 'Andres']), 'Age': pd.Series([25, 26, 25, 23, 30, 29, 23, 34, 40, 30, 51, 46]), 'Rating': pd.Series([4.23, 3.24, 3.98, 2.56, 3.20, 4.6, 3.8, 3.78, 2.98, 4.80, 4.10, 3.65])}unsorted_df = pd.DataFrame(d)print(unsorted_df)""" Name Age Rating0 Tom 25 4.231 James 26 3.242 Ricky 25 3.983 Vin 23 2.564 Steve 30 3.205 Minsu 29 4.606 Jack 23 3.807 Lee 34 3.788 David 40 2.989 Gasper 30 4.8010 Betina 51 4.1011 Andres 46 3.65"""# 按照列标签进行排序sorted_df = unsorted_df.sort_index(axis=1)print(sorted_df)"""Age Name Rating0 25 Tom 4.231 26 James 3.242 25 Ricky 3.983 23 Vin 2.564 30 Steve 3.205 29 Minsu 4.606 23 Jack 3.807 34 Lee 3.788 40 David 2.989 30 Gasper 4.8010 51 Betina 4.1011 46 Andres 3.65"""# 按照年龄排序print(sorted_df.sort_values('Age'))""" Age Name Rating3 23 Vin 2.566 23 Jack 3.800 25 Tom 4.232 25 Ricky 3.981 26 James 3.245 29 Minsu 4.604 30 Steve 3.209 30 Gasper 4.807 34 Lee 3.788 40 David 2.9811 46 Andres 3.6510 51 Betina 4.10"""# 联合间接排序print(sorted_df.sort_values(['Age', 'Rating']))""" Age Name Rating3 23 Vin 2.566 23 Jack 3.802 25 Ricky 3.980 25 Tom 4.231 26 James 3.245 29 Minsu 4.604 30 Steve 3.209 30 Gasper 4.807 34 Lee 3.788 40 David 2.9811 46 Andres 3.6510 51 Betina 4.10"""# 控制排序顺序print(sorted_df.sort_values(['Age', 'Rating'], ascending=[True, False]))""" Age Name Rating6 23 Jack 3.803 23 Vin 2.560 25 Tom 4.232 25 Ricky 3.981 26 James 3.245 29 Minsu 4.609 30 Gasper 4.804 30 Steve 3.207 34 Lee 3.788 40 David 2.9811 46 Andres 3.6510 51 Betina 4.10"""
pandas分组
在许多情况下,我们将数据分成多个集合,并在每个子集上应用一些函数。在应用函数中,可以执行以下操作 :
-
聚合 - 计算汇总统计
-
转换 - 执行一些特定于组的操作
-
过滤 - 在某些情况下丢弃数据
import pandas as pdipl_data = { 'Team': ['Riders', 'Riders', 'Devils', 'Devils', 'Kings', 'kings', 'Kings', 'Kings', 'Riders', 'Royals', 'Royals', 'Riders'], 'Rank': [1, 2, 2, 3, 3,4 ,1 ,1,2 , 4,1,2], 'Year': [2014,2015,2014,2015,2014,2015,2016,2017,2016,2014,2015,2017], 'Points':[876,789,863,673,741,812,756,788,694,701,804,690]}df = pd.DataFrame(ipl_data)print(df)
# 按照年份Year字段分组print (df.groupby('Year'))# 查看分组结果print (df.groupby('Year').groups)
迭代遍历分组
groupby返回可迭代对象,可以使用for循环遍历:
print (df.groupby('Year').groups)# 遍历每个分组for year,group in grouped: print (year) print (group)
grouped = df.groupby('Year')print (grouped.get_group(2014))
import pandas as pdipl_data = { 'Team': ['Riders', 'Riders', 'Devils', 'Devils', 'Kings', 'kings', 'Kings', 'Kings', 'Riders', 'Royals', 'Royals', 'Riders'], 'Rank': [1, 2, 2, 3, 3, 4, 1, 1, 2, 4, 1, 2], 'Year': [2014, 2015, 2014, 2015, 2014, 2015, 2016, 2017, 2016, 2014, 2015, 2017], 'Points': [876, 789, 863, 673, 741, 812, 756, 788, 694, 701, 804, 690]}df = pd.DataFrame(ipl_data)print(df)""" Team Rank Year Points0 Riders 1 2014 8761 Riders 2 2015 7892 Devils 2 2014 8633 Devils 3 2015 6734 Kings 3 2014 7415 kings 4 2015 8126 Kings 1 2016 7567 Kings 1 2017 7888 Riders 2 2016 6949 Royals 4 2014 70110 Royals 1 2015 80411 Riders 2 2017 690"""#按评分排序print(df.sort_values('Points',ascending=False))"""Team Rank Year Points0 Riders 1 2014 8762 Devils 2 2014 8635 kings 4 2015 81210 Royals 1 2015 8041 Riders 2 2015 7897 Kings 1 2017 7886 Kings 1 2016 7564 Kings 3 2014 7419 Royals 4 2014 7018 Riders 2 2016 69411 Riders 2 2017 6903 Devils 3 2015 673"""# 按照年份Year字段分组,查看每个分组的信息grouped = df.groupby('Year')print(grouped)#分组对象# 查看分组结果print(grouped.groups)"""{2014: Int64Index([0, 2, 4, 9], dtype='int64'), 2015: Int64Index([1, 3, 5, 10], dtype='int64'), 2016: Int64Index([6, 8], dtype='int64'), 2017: Int64Index([7, 11], dtype='int64')}"""#遍历查看每个分组的信息for year, group in grouped: print(year) print(group)"""2014 Team Rank Year Points0 Riders 1 2014 8762 Devils 2 2014 8634 Kings 3 2014 7419 Royals 4 2014 7012015 Team Rank Year Points1 Riders 2 2015 7893 Devils 3 2015 6735 kings 4 2015 81210 Royals 1 2015 8042016 Team Rank Year Points6 Kings 1 2016 7568 Riders 2 2016 6942017 Team Rank Year Points7 Kings 1 2017 78811 Riders 2 2017 690"""#若不希望获取所有分组,如下获取某个分组细节:group = grouped.get_group(2014)print(group)""" Team Rank Year Points0 Riders 1 2014 8762 Devils 2 2014 8634 Kings 3 2014 7419 Royals 4 2014 701"""
分组聚合
聚合函数为每个组返回聚合值。当创建了分组(group by)对象,就可以对每个分组数据执行求和、求标准差等操作。
# 聚合每一年的平均的分grouped = df.groupby('Year')print (grouped['Points'].agg(np.mean))# 聚合每一年的分数之和、平均分、标准差grouped = df.groupby('Year')agg = grouped['Points'].agg([np.sum, np.mean, np.std])print (agg)
import pandas as pdimport numpy as npipl_data = { 'Team': ['Riders', 'Riders', 'Devils', 'Devils', 'Kings', 'kings', 'Kings', 'Kings', 'Riders', 'Royals', 'Royals', 'Riders'], 'Rank': [1, 2, 2, 3, 3, 4, 1, 1, 2, 4, 1, 2], 'Year': [2014, 2015, 2014, 2015, 2014, 2015, 2016, 2017, 2016, 2014, 2015, 2017], 'Points': [876, 789, 863, 673, 741, 812, 756, 788, 694, 701, 804, 690]}df = pd.DataFrame(ipl_data)# print(df)# 按照年份Year字段分组,查看每个分组的信息grouped = df.groupby('Year')#分组后针对每一组执行聚合操作,(类似数据库中的组函数)r = grouped['Points'].agg(np.mean)print(r,type(r),r.values)"""Year2014 795.252015 769.502016 725.002017 739.00Name: Points, dtype: float64[795.25 769.5 725. 739. ]""" 平均值,和,标准差 r = grouped['Points'].agg([np.mean,np.sum,np.std])print(r)""" mean sum stdYear 2014 795.25 3181 87.4390262015 769.50 3078 65.0358882016 725.00 1450 43.8406202017 739.00 1478 69.296465"""
pandas数据表关联操作
Pandas具有功能全面的高性能内存中连接操作,与SQL等关系数据库非常相似。 Pandas提供了一个单独的merge()
函数,作为DataFrame对象之间所有标准数据库连接操作的入口。
合并两个DataFrame:
import pandas as pdleft = pd.DataFrame({ 'student_id':[1,2,3,4,5,6,7,8,9,10], 'student_name': ['Alex', 'Amy', 'Allen', 'Alice', 'Ayoung', 'Billy', 'Brian', 'Bran', 'Bryce', 'Betty'], 'class_id':[1,1,1,2,2,2,3,3,3,4]})right = pd.DataFrame( { 'class_id':[1,2,3,5], 'class_name': ['ClassA', 'ClassB', 'ClassC', 'ClassE']})print (left)print("========================================")print (right)print("========================================")# 合并两个DataFramers = pd.merge(left,right)print(rs)""" student_id student_name class_id class_name0 1 Alex 1 ClassA1 2 Amy 1 ClassA2 3 Allen 1 ClassA3 4 Alice 2 ClassB4 5 Ayoung 2 ClassB5 6 Billy 2 ClassB6 7 Brian 3 ClassC7 8 Bran 3 ClassC8 9 Bryce 3 ClassC"""rs = pd.merge(left,right,how='outer')#内连接print(rs)""" student_id student_name class_id class_name0 1.0 Alex 1 ClassA1 2.0 Amy 1 ClassA2 3.0 Allen 1 ClassA3 4.0 Alice 2 ClassB4 5.0 Ayoung 2 ClassB5 6.0 Billy 2 ClassB6 7.0 Brian 3 ClassC7 8.0 Bran 3 ClassC8 9.0 Bryce 3 ClassC9 10.0 Betty 4 NaN10 NaN NaN 5 ClassE"""rs = pd.merge(left,right,how='left')#左外链接print(rs)""" student_id student_name class_id class_name0 1 Alex 1 ClassA1 2 Amy 1 ClassA2 3 Allen 1 ClassA3 4 Alice 2 ClassB4 5 Ayoung 2 ClassB5 6 Billy 2 ClassB6 7 Brian 3 ClassC7 8 Bran 3 ClassC8 9 Bryce 3 ClassC9 10 Betty 4 NaN"""rs = pd.merge(left,right,how='right')#右外链接print(rs)""" student_id student_name class_id class_name0 1.0 Alex 1 ClassA1 2.0 Amy 1 ClassA2 3.0 Allen 1 ClassA3 4.0 Alice 2 ClassB4 5.0 Ayoung 2 ClassB5 6.0 Billy 2 ClassB6 7.0 Brian 3 ClassC7 8.0 Bran 3 ClassC8 9.0 Bryce 3 ClassC9 NaN NaN 5 ClassE"""
其他合并方法同数据库相同:
SQL等效 | 描述 | |
---|---|---|
left | LEFT OUTER JOIN | 使用左侧对象的键 |
right | RIGHT OUTER JOIN | 使用右侧对象的键 |
outer | FULL OUTER JOIN | 使用键的联合 |
inner | INNER JOIN | 使用键的交集 |
试验:
# 合并两个DataFrame (左连接)rs = pd.merge(left,right,on='subject_id', how='right')print(rs)# 合并两个DataFrame (左连接)rs = pd.merge(left,right,on='subject_id', how='outer')print(rs)# 合并两个DataFrame (左连接)rs = pd.merge(left,right,on='subject_id', how='inner')print(rs)
pandas透视表与交叉表
有如下数据:
"""表连接"""import pandas as pdleft = pd.DataFrame({ 'student_id': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10], 'student_name': ['Alex', 'Amy', 'Allen', 'Alice', 'Ayoung', 'Billy', 'Brian', 'Bran', 'Bryce', 'Betty'], 'class_id': [1, 1, 1, 2, 2, 2, 3, 3, 3, 4]})right = pd.DataFrame( { 'class_id': [1, 2, 3, 5], 'class_name': ['ClassA', 'ClassB', 'ClassC', 'ClassE']})# 合并两个DataFramedata = pd.merge(left, right)print(data)""" student_id student_name class_id class_name0 1 Alex 1 ClassA1 2 Amy 1 ClassA2 3 Allen 1 ClassA3 4 Alice 2 ClassB4 5 Ayoung 2 ClassB5 6 Billy 2 ClassB6 7 Brian 3 ClassC7 8 Bran 3 ClassC8 9 Bryce 3 ClassC"""
透视表
透视表(pivot table)是各种电子表格程序和其他数据分析软件中一种常见的数据汇总工具。它根据一个或多个键对数据进行分组聚合,并根据每个分组进行数据汇总。
"""透视表"""import pandas as pdleft = pd.DataFrame({ 'student_id':[1,2,3,4,5,6,7,8,9,10], 'student_name': ['Alex', 'Amy', 'Allen', 'Alice', 'Ayoung', 'Billy', 'Brian', 'Bran', 'Bryce', 'Betty'], 'age':[11,11,11,21,21,21,31,31,31,41], 'gender':['M','F','M','M','M','F','F','M','F','F'], 'score':[16,19,14,23,27,2,39,79,56,99], 'class_id':[1,1,1,2,2,2,3,3,3,4]})right = pd.DataFrame( { 'class_id':[1,2,3,5], 'class_name': ['ClassA', 'ClassB', 'ClassC', 'ClassE']})# 合并两个DataFramedata = pd.merge(left,right)# print(data)# 以class_id做分组汇总数据,默认聚合统计所有列r = data.pivot_table(index=['class_id'])print(r)#统计每班的平均年龄""" age score student_idclass_id 1 11 16.333333 22 21 17.333333 53 31 58.000000 8"""# 以class_id做分组汇总数据,默认聚合统计所有列r = data.pivot_table(index=['class_id','gender'])print(r)""" age score student_idclass_id gender 1 F 11 19.0 2.0 M 11 15.0 2.02 F 21 2.0 6.0 M 21 25.0 4.53 F 31 47.5 8.0 M 31 79.0 8.0"""# 以class_id与gender做分组汇总数据,聚合统计score列print(data.pivot_table(index=['class_id', 'gender'], values=['score']))""" scoreclass_id gender 1 F 19.0 M 15.02 F 2.0 M 25.03 F 47.5 M 79.0"""# 以class_id与gender做分组汇总数据,聚合统计score列,针对age的每个值列级分组统计print(data.pivot_table(index=['class_id', 'gender'], values=['score'], columns=['age']))""" score age 11 21 31class_id gender 1 F 19.0 NaN NaN M 15.0 NaN NaN2 F NaN 2.0 NaN M NaN 25.0 NaN3 F NaN NaN 47.5 M NaN NaN 79.0"""# 以class_id与gender做分组汇总数据,聚合统计score列,针对age的每个值列级分组统计,添加行、列小计# print(data.pivot_table(index=['class_id', 'gender'],# values=['score'],# columns=['age'],# margin=True))# 以class_id与gender做分组汇总数据,聚合统计score列,针对age的每个值列级分组统计,添加行、列小计# print(data.pivot_table(index=['class_id', 'gender'],# values=['score'],# columns=['age'],# margins=True,# aggfunc='max'))
交叉表(cross-tabulation, 简称crosstab)是一种用于计算分组频率的特殊透视表
# 按照class_id分组,针对不同的gender,统计数量print(pd.crosstab(data.class_id, data.gender, margins=True))
pandas可视化
基本绘图:绘图
import pandas as pdimport numpy as npimport matplotlib.pyplot as mp df = pd.DataFrame(np.random.randn(10,4),index=pd.date_range('2018/12/18', periods=10), columns=list('ABCD'))df.plot()mp.show()
plot方法允许除默认线图之外的少数绘图样式。 这些方法可以作为plot()
的kind
关键字参数。这些包括 :
-
bar
或barh
为条形 -
hist
为直方图 -
scatter
为散点图
条形图
df = pd.DataFrame(np.random.rand(10,4),columns=['a','b','c','d'])df.plot.bar()# df.plot.bar(stacked=True)mp.show()
直方图
df = pd.DataFrame()df['a'] = pd.Series(np.random.normal(0, 1, 1000)-1)df['b'] = pd.Series(np.random.normal(0, 1, 1000))df['c'] = pd.Series(np.random.normal(0, 1, 1000)+1)print(df)df.plot.hist(bins=20)mp.show()
散点图
df = pd.DataFrame(np.random.rand(50, 4), columns=['a', 'b', 'c', 'd'])df.plot.scatter(x='a', y='b')mp.show()
饼状图
df = pd.DataFrame(3 * np.random.rand(4), index=['a', 'b', 'c', 'd'], columns=['x'])df.plot.pie(subplots=True)mp.show()