在看numpy和pandas的时候,发现了一个有趣的现象。由于std函数的默认参数不同所以求出来的standard deviation 是不同的。
看代码
In [1]: import numpy as np
In [2]: import pandas as pd
In [3]: df5=pd.DataFrame(np.arange(9).reshape(3,3), columns=['a','b','c'])
In [4]: df5
Out[4]:
a b c
0 0 1 2
1 3 4 5
2 6 7 8
In [5]: df5.apply(np.std, axis=1)
Out[5]:
0 0.816497
1 0.816497
2 0.816497
dtype: float64
In [6]: df5.std(axis=1)
Out[6]:
0 1.0
1 1.0
2 1.0
dtype: float64
为什么使用np.std 和pandas的std函数,得出的结果却不一样呢?
help一下。在numpy中ddof默认值是0.
std(a, axis=None, dtype=None, out=None, ddof=0, keepdims=<no value>)
Compute the standard deviation along the specified axis.
Returns the standard deviation, a measure of the spread of a distribution,
of the array elements. The standard deviation is computed for the
flattened array by default, otherwise over the specified axis.
Parameters
----------
a : array_like
Calculate the standard deviation of these values.
axis : None or int or tuple of ints, optional
Axis or axes along which the standard deviation is computed. The
default is to compute the standard deviation of the flattened array.
.. versionadded:: 1.7.0
If this is a tuple of ints, a standard deviation is performed over
multiple axes, instead of a single axis or all the axes as before.
dtype : dtype, optional
Type to use in computing the standard deviation. For arrays of
integer type the default is float64, for arrays of float types it is
the same as the array type.
out : ndarray, optional
Alternative output array in which to place the result. It must have
the same shape as the expected output but the type (of the calculated
values) will be cast if necessary.
ddof : int, optional
Means Delta Degrees of Freedom. The divisor used in calculations
is ``N - ddof``, where ``N`` represents the number of elements.
By default `ddof` is zero.
再看pandas中的help信息。ddof的默认值是1
Help on method std in module pandas.core.frame:
std(axis=None, skipna=None, level=None, ddof=1, numeric_only=None, **kwargs) method of pandas.core.frame.DataFrame instance
Return sample standard deviation over requested axis.
Normalized by N-1 by default. This can be changed using the ddof argument
Parameters
----------
axis : {index (0), columns (1)}
skipna : boolean, default True
Exclude NA/null values. If an entire row/column is NA, the result
will be NA
level : int or level name, default None
If the axis is a MultiIndex (hierarchical), count along a
particular level, collapsing into a Series
ddof : int, default 1
Delta Degrees of Freedom. The divisor used in calculations is N - ddof,
where N represents the number of elements.
ddof: Delta Degrees of Freedom.
Degrees of Freedom 是自由度,Delta Degrees of Freedom是什么我是搞不清楚.
但是找到了原因,在调用np.std传入ddof的值就可以了。
In [12]: df5.apply(np.std, axis=1, ddof=1)
Out[12]:
0 1.0
1 1.0
2 1.0
dtype: float64
In [13]: df5.std(axis=1, ddof=1)
Out[13]:
0 1.0
1 1.0
2 1.0
dtype: float64
PREVIOUSShell里面打印utf 8字符