In [10]: df2 Out[10]: A B C D E F 01.02013-01-02 1.03 test foo 11.02013-01-02 1.03 train foo 21.02013-01-02 1.03 test foo 31.02013-01-02 1.03 train foo
(4)DataFrame 的列有不同的数据类型
1 2 3 4 5 6 7 8 9
In [11]: df2.dtypes Out[11]: A float64 B datetime64[ns] C float32 D int32 E category F object dtype: object
2、查看数据
(1)查看 DataFrame 头尾数据,默认长度为 5
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
In [13]: df.head() Out[13]: A B C D 2013-01-01 0.469112 -0.282863 -1.509059 -1.135632 2013-01-02 1.212112 -0.1732150.119209 -1.044236 2013-01-03 -0.861849 -2.104569 -0.4949291.071804 2013-01-04 0.721555 -0.706771 -1.0395750.271860 2013-01-05 -0.4249720.5670200.276232 -1.087401
In [14]: df.tail(3) Out[14]: A B C D 2013-01-04 0.721555 -0.706771 -1.0395750.271860 2013-01-05 -0.4249720.5670200.276232 -1.087401 2013-01-06 -0.6736900.113648 -1.4784270.524988
In [24]: df[0:3] Out[24]: A B C D 2013-01-01 0.469112 -0.282863 -1.509059 -1.135632 2013-01-02 1.212112 -0.1732150.119209 -1.044236 2013-01-03 -0.861849 -2.104569 -0.4949291.071804
In [25]: df['20130102':'20130104'] Out[25]: A B C D 2013-01-02 1.212112 -0.1732150.119209 -1.044236 2013-01-03 -0.861849 -2.104569 -0.4949291.071804 2013-01-04 0.721555 -0.706771 -1.0395750.271860
3.2、按标签选择(loc)
(1)用标签提取一行数据
1 2 3 4 5 6 7
In [26]: df.loc[dates[0]] Out[26]: A 0.469112 B -0.282863 C -1.509059 D -1.135632 Name: 2013-01-01 00:00:00, dtype: float64
(2)用标签提取多列数据
1 2 3 4 5 6 7 8 9
In [27]: df.loc[:, ['A', 'B']] Out[27]: A B 2013-01-01 0.469112 -0.282863 2013-01-02 1.212112 -0.173215 2013-01-03 -0.861849 -2.104569 2013-01-04 0.721555 -0.706771 2013-01-05 -0.4249720.567020 2013-01-06 -0.6736900.113648
(3)用标签切片,左闭右闭
1 2 3 4 5 6
In [28]: df.loc['20130102':'20130104', ['A' : 'B']] Out[28]: A B 2013-01-02 1.212112 -0.173215 2013-01-03 -0.861849 -2.104569 2013-01-04 0.721555 -0.706771
(4)返回对象降维(有一维使用列名直接选择)
1 2 3 4 5
In [29]: df.loc['20130102', ['A', 'B']] Out[29]: A 1.212112 B -0.173215 Name: 2013-01-02 00:00:00, dtype: float64
(5)提取标量值(全部维度都是用列名直接选择)
1 2
In [30]: df.loc[dates[0], 'A'] Out[30]: 0.46911229990718628
(6)快速访问标量值
1 2
In [31]: df.at[dates[0], 'A'] Out[31]: 0.46911229990718628
性能对比:相差25%,还可以接受
1 2 3 4 5
In [31]: %timeit df.loc[dates[0], 'A'] %timeit df.at[dates[0], 'A'] Out[31]: 32.3 µs ± 764 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) 24.7 µs ± 580 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
3.3、按位置选择(iloc)【重点】
(1)用整数位置选择
1 2 3 4 5 6 7
In [32]: df.iloc[3] Out[32]: A 0.721555 B -0.706771 C -1.039575 D 0.271860 Name: 2013-01-04 00:00:00, dtype: float64
(2)用整数切片,左闭右开
1 2 3 4 5
In [33]: df.iloc[3:5, 0:2] Out[33]: A B 2013-01-04 0.721555 -0.706771 2013-01-05 -0.4249720.567020
(3)用整数列表按位置切片
1 2 3 4 5 6
In [34]: df.iloc[[1, 2, 4], [0, 2]] Out[34]: A C 2013-01-02 1.2121120.119209 2013-01-03 -0.861849 -0.494929 2013-01-05 -0.4249720.276232
(4) 显式整行切片
1 2 3 4 5 6
In [34]: df.iloc[[1, 2, 4], [0, 2]] Out[34]: A C 2013-01-02 1.2121120.119209 2013-01-03 -0.861849 -0.494929 2013-01-05 -0.4249720.276232
(5)显式整列切片
1 2 3 4 5 6 7 8 9
In [36]: df.iloc[:, 1:3] Out[36]: B C 2013-01-01 -0.282863 -1.509059 2013-01-02 -0.1732150.119209 2013-01-03 -2.104569 -0.494929 2013-01-04 -0.706771 -1.039575 2013-01-05 0.5670200.276232 2013-01-06 0.113648 -1.478427
(6)显式提取值
1 2
In [37]: df.iloc[1, 1] Out[37]: -0.17321464905330858
(7)快速访问标量
1 2
In [38]: df.iat[1, 1] Out[38]: -0.17321464905330858
性能对比:相差无几,10%左右。
1 2 3 4 5
In [39]: %timeit df.iloc[1, 1] %timeit df.iat[1, 1] Out[39]: 33.1 µs ± 578 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) 29 µs ± 911 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
3.4、布尔索引
(1)用单列的值选择数据
1 2 3 4 5 6
In [39]: df[df.A > 0] Out[39]: A B C D 2013-01-01 0.469112 -0.282863 -1.509059 -1.135632 2013-01-02 1.212112 -0.1732150.119209 -1.044236 2013-01-04 0.721555 -0.706771 -1.0395750.271860
(2)选择 DataFrame 里满足条件的值
1 2 3 4 5 6 7 8 9
In [40]: df[df > 0] Out[40]: A B C D 2013-01-01 0.469112 NaN NaN NaN 2013-01-02 1.212112 NaN 0.119209 NaN 2013-01-03 NaN NaN NaN 1.071804 2013-01-04 0.721555 NaN NaN 0.271860 2013-01-05 NaN 0.5670200.276232 NaN 2013-01-06 NaN 0.113648 NaN 0.524988
(3)使用 isin() 筛选
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
In [41]: df2 = df.copy()
In [42]: df2['E'] = ['one', 'one', 'two', 'three', 'four', 'three']
In [43]: df2 Out[43]: A B C D E 2013-01-01 0.469112 -0.282863 -1.509059 -1.135632 one 2013-01-02 1.212112 -0.1732150.119209 -1.044236 one 2013-01-03 -0.861849 -2.104569 -0.4949291.071804 two 2013-01-04 0.721555 -0.706771 -1.0395750.271860 three 2013-01-05 -0.4249720.5670200.276232 -1.087401 four 2013-01-06 -0.6736900.113648 -1.4784270.524988 three
In [44]: df2[df2['E'].isin(['two', 'four'])] Out[44]: A B C D E 2013-01-03 -0.861849 -2.104569 -0.4949291.071804 two 2013-01-05 -0.4249720.5670200.276232 -1.087401 four
3.5、赋值
df
1 2 3 4 5 6 7
A B C D 2013-01-01 0.469112 -0.282863 -1.509059 -1.135632 2013-01-02 1.212112 -0.1732150.119209 -1.044236 2013-01-03 -0.861849 -2.104569 -0.4949291.071804 2013-01-04 0.721555 -0.706771 -1.0395750.271860 2013-01-05 -0.4249720.5670200.276232 -1.087401 2013-01-06 -0.6736900.113648 -1.4784270.524988
In [51]: df Out[51]: A B C D F 2013-01-01 0.0000000.000000 -1.5090595 NaN 2013-01-02 1.212112 -0.1732150.11920951.0 2013-01-03 -0.861849 -2.104569 -0.49492952.0 2013-01-04 0.721555 -0.706771 -1.03957553.0 2013-01-05 -0.4249720.5670200.27623254.0
(5)使用 where 条件赋值
1 2 3 4 5 6 7 8 9 10 11 12 13
In [52]: df2 = df.copy()
In [53]: df2[df2 > 0] = -df2
In [54]: df2 Out[54]: A B C D F 2013-01-01 0.0000000.000000 -1.509059 -5 NaN 2013-01-02 -1.212112 -0.173215 -0.119209 -5 -1.0 2013-01-03 -0.861849 -2.104569 -0.494929 -5 -2.0 2013-01-04 -0.721555 -0.706771 -1.039575 -5 -3.0 2013-01-05 -0.424972 -0.567020 -0.276232 -5 -4.0 2013-01-06 -0.673690 -0.113648 -1.478427 -5 -5.0
4、缺失值
Pandas 主要用 np.nan 表示缺失数据。 计算时,默认不包含空值。
(1)重建索引(reindex)
1 2 3 4 5 6 7 8 9 10 11
In [55]: df1 = df.reindex(index=dates[0:4], columns=list(df.columns) + ['E'])
In [56]: df1.loc[dates[0]:dates[1], 'E'] = 1
In [57]: df1 Out[57]: A B C D F E 2013-01-01 0.0000000.000000 -1.5090595 NaN 1.0 2013-01-02 1.212112 -0.1732150.11920951.01.0 2013-01-03 -0.861849 -2.104569 -0.49492952.0 NaN 2013-01-04 0.721555 -0.706771 -1.03957553.0 NaN
(2)删除所有含缺失值的行
1 2 3 4
In [58]: df1.dropna(how='any') Out[58]: A B C D F E 2013-01-02 1.212112 -0.1732150.11920951.01.0
(3)填充缺失值
1 2 3 4 5 6 7
In [59]: df1.fillna(value=5) Out[59]: A B C D F E 2013-01-01 0.0000000.000000 -1.50905955.01.0 2013-01-02 1.212112 -0.1732150.11920951.01.0 2013-01-03 -0.861849 -2.104569 -0.49492952.05.0 2013-01-04 0.721555 -0.706771 -1.03957553.05.0
(4)获取 nan 值的布尔掩码
1 2 3 4 5 6 7
In [60]: pd.isna(df1) Out[60]: A B C D F E 2013-01-01 FalseFalseFalseFalseTrueFalse 2013-01-02 FalseFalseFalseFalseFalseFalse 2013-01-03 FalseFalseFalseFalseFalseTrue 2013-01-04 FalseFalseFalseFalseFalseTrue
5、运算
5.1、统计
测试数据
1 2 3 4 5 6 7
A B C D F 2013-01-01 0.0000000.000000 -0.0516855 NaN 2013-01-02 -1.0095920.1416431.24947551.0 2013-01-03 1.3474920.3663890.20202752.0 2013-01-04 -2.9733151.4557530.25798653.0 2013-01-05 0.7928360.896040 -0.44378854.0 2013-01-06 -1.4521240.4658760.69822055.0
(1)一般情况下,运算时排除缺失值
1 2 3 4 5 6 7 8
In [61]: df.mean() Out[61]: A -0.004474 B -0.383981 C -0.687758 D 5.000000 F 3.000000 dtype: float64
In [63]: s = pd.Series([1, 3, 5, np.nan, 6, 8], index=dates).shift(2)
In [64]: s Out[64]: 2013-01-01 NaN 2013-01-02 NaN 2013-01-03 1.0 2013-01-04 3.0 2013-01-05 5.0 2013-01-06 NaN Freq: D, dtype: float64
In [65]: df.sub(s, axis='index') Out[65]: A B C D F 2013-01-01 NaN NaN NaN NaN NaN 2013-01-02 NaN NaN NaN NaN NaN 2013-01-03 -1.861849 -3.104569 -1.4949294.01.0 2013-01-04 -2.278445 -3.706771 -4.0395752.00.0 2013-01-05 -5.424972 -4.432980 -4.7237680.0 -1.0 2013-01-06 NaN NaN NaN NaN NaN
5.2、Apply 映射函数
可以对所有行/列进行统一映射操作,默认为列
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
In [66]: df.apply(np.cumsum) Out[66]: A B C D F 2013-01-01 0.0000000.000000 -1.5090595 NaN 2013-01-02 1.212112 -0.173215 -1.389850101.0 2013-01-03 0.350263 -2.277784 -1.884779153.0 2013-01-04 1.071818 -2.984555 -2.924354206.0 2013-01-05 0.646846 -2.417535 -2.6481222510.0 2013-01-06 -0.026844 -2.303886 -4.1265493015.0
In [67]: df.apply(lambda x: x.max() - x.min()) Out[67]: A 2.073961 B 2.671590 C 1.785291 D 0.000000 F 4.000000 dtype: float64
In [92]: df Out[92]: A B C D 0 foo one -1.202872 -0.055224 1 bar one -1.8144702.395985 2 foo two 1.0186011.552825 3 bar three -0.5954470.166599 4 foo two 1.3954330.047609 5 bar two -0.392670 -0.136473 6 foo one 0.007207 -0.561757 7 foo three 1.928123 -1.623033
(1)先分组,再用 sum() 算出每组的汇总数据
1 2 3 4 5 6
In [93]: df.groupby('A').sum() Out[93]: C D A bar -2.8025882.42611 foo 3.146492 -0.63958
(2)多层分组后,生成多层索引,也可以应用 sum 函数
1 2 3 4 5 6 7 8 9 10
In [94]: df.groupby(['A', 'B']).sum() Out[94]: C D A B bar one -1.8144702.395985 three -0.5954470.166599 two -0.392670 -0.136473 foo one -1.195665 -0.616981 three 1.928123 -1.623033 two 2.4140341.600434
In [96]: index = pd.MultiIndex.from_tuples(tuples, names=['first', 'second'])
In [97]: df = pd.DataFrame(np.random.randn(8, 2), index=index, columns=['A', 'B'])
In [98]: df2 = df[:4]
In [99]: df2 Out[99]: A B first second bar one 0.029399 -0.542108 two 0.282696 -0.087302 baz one -1.5751701.771208 two 0.8164821.100230
多层索引选择数据
1 2 3 4 5 6 7
In [96]: df.loc['bar', 'one'] ....: df Out[96]: A -0.204011 B -1.956557 Name: (bar, one), dtype: float64
8.2、堆叠(Stack)
stack() 方法将 DataFrame 压缩至一层,即 Series
1 2 3 4 5 6 7 8 9 10 11 12 13
In [100]: stacked = df2.stack()
In [101]: stacked Out[101]: first second B -0.542108 two A 0.282696 B -0.087302 baz one A -1.575170 B 1.771208 two A 0.816482 B 1.100230 dtype: float64
压缩后的 DataFrame 或 Series 具有多层索引,stack() 的逆操作是 unstack(),默认为拆叠最后一层。
In [102]: stacked.unstack() Out[102]: A B first second bar one 0.029399 -0.542108 two 0.282696 -0.087302 baz one -1.5751701.771208 two 0.8164821.100230
In [103]: stacked.unstack(1) Out[103]: second one two first bar A 0.0293990.282696 B -0.542108 -0.087302 baz A -1.5751700.816482 B 1.7712081.100230
In [104]: stacked.unstack(0) Out[104]: first bar baz second one A 0.029399 -1.575170 B -0.5421081.771208 two A 0.2826960.816482 B -0.0873021.100230
In [106]: df Out[106]: A B C D E 0 one A foo 1.418757 -0.179666 1 one B foo -1.8790241.291836 2 two C foo 0.536826 -0.009614 3 three A bar 1.0061600.392149 4 one B bar -0.0297160.264599 5 one C bar -1.146178 -0.057409 6 two A foo 0.100900 -1.425638 7 three B foo -1.0350181.024098 8 one C foo 0.314665 -0.106062 9 one A bar -0.7737231.824375 10 two B bar -1.1706530.595974 11 three C bar 0.6487401.167115
生成透视图
1 2 3 4 5 6 7 8 9 10 11 12 13
In [107]: pd.pivot_table(df, values='D', index=['A', 'B'], columns=['C']) Out[107]: C bar foo A B one A -0.7737231.418757 B -0.029716 -1.879024 C -1.1461780.314665 three A 1.006160 NaN B NaN -1.035018 C 0.648740 NaN two A NaN 0.100900 B -1.170653 NaN C NaN 0.536826
In [132]: df["grade"] Out[132]: 0 very good 1 good 2 good 3 very good 4 very good 5 very bad Name: grade, dtype: category Categories (5, object): [very bad, bad, medium, good, very good]
类别排序时,是按照生成类别时的顺序排序,而不是按照字典序。
1 2 3 4 5 6 7 8 9
In [133]: df.sort_values(by="grade") Out[133]: id raw_grade grade 56 e very bad 12 b good 23 b good 01 a very good 34 a very good 45 a very good
按类列分组(groupby)时,即便某类别为空,也会显示。
1 2 3 4 5 6 7 8 9
In [134]: df.groupby("grade").size() Out[134]: grade very bad 1 bad 0 medium 0 good 2 very good 3 dtype: int64
12、可视化(Visualization)
matplotlib 可视化也就图一乐,诶,真要可视化还得看我 pandas。
(1)单特征
1 2 3 4 5 6 7 8
In [135]: ts = pd.Series(np.random.randn(1000), .....: index=pd.date_range('1/1/2000', periods=1000)) .....:
In [136]: ts = ts.cumsum()
In [137]: ts.plot() Out[137]: <matplotlib.axes._subplots.AxesSubplot at 0x7f2b5771ac88>