该例子出自pandas-for-everyone一书. 使用了如下的csv文件.
In [80]: !cat ./gapminder/other_csv/scientists.csv
Name,Born,Died,Age,Occupation
Rosaline Franklin,1920-07-25,1958-04-16,37,Chemist
William Gosset,1876-06-13,1937-10-16,61,Statistician
Florence Nightingale,1820-05-12,1910-08-13,90,Nurse
Marie Curie,1867-11-07,1934-07-04,66,Chemist
Rachel Carson,1907-05-27,1964-04-14,56,Biologist
John Snow,1813-03-15,1858-06-16,45,Physician
Alan Turing,1912-06-23,1954-06-07,41,Computer Scientist
Johann Gauss,1777-04-30,1855-02-23,77,Mathematician
读入csv文件,将其中的一列赋值给ages,然后shuffle这一列,会发现ages会跟着变化,所以ages只是指向这一列的指针而已
In [81]: import numpy as np
In [82]: import pandas as pd
In [83]: import matplotlib.pyplot as plt
In [84]: scientists=pd.read_csv("./gapminder/other_csv/scientists.csv")
In [85]: scientists.shape
Out[85]: (8, 5)
In [86]: scientists.columns
Out[86]: Index(['Name', 'Born', 'Died', 'Age', 'Occupation'], dtype='object')
In [87]: scientists.dtypes
Out[87]:
Name object
Born object
Died object
Age int64
Occupation object
dtype: object
In [88]: ages=scientists['Age']
In [89]: ages
Out[89]:
0 37
1 61
2 90
3 66
4 56
5 45
6 41
7 77
Name: Age, dtype: int64
In [90]: import random
In [91]: random.seed(42)
In [92]: random.shuffle(scientists['Age'])
/usr/lib/python3.7/random.py:278: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
x[i], x[j] = x[j], x[i]
然后根据born和died两列创造出新的两列, 类型为datetime64
In [93]: born_datetime=pd.to_datetime(scientists['Born'], format="%Y-%m-%d")
In [94]: died_datetime=pd.to_datetime(scientists['Died'], format='%Y-%m-%d')
In [95]: scientists['Born_dt'], scientists['Died_dt']=(born_datetime, died_datetime)
In [96]: scientists.dtypes
Out[96]:
Name object
Born object
Died object
Age int64
Occupation object
Born_dt datetime64[ns]
Died_dt datetime64[ns]
dtype: object
In [97]: scientists.shape
Out[97]: (8, 7)
最后使用得到的两列datetime64的类型做减法,得到timedelta64数据类型,然后将这个类型转化为int.
#下面两种方法都可以
scientists['age_years_dt']=scientists['age_days_dt'].astype(pd.Timedelta).apply(lambda l: l.days //365)
scientists['age_years_dt']=scientists['age_days_dt'].astype('timedelta64[D]').astype(int) // 365
In [102]: scientists.dtypes
Out[102]:
Name object
Born object
Died object
Age int64
Occupation object
Born_dt datetime64[ns]
Died_dt datetime64[ns]
age_days_dt timedelta64[ns]
age_years_dt int64
dtype: object
In [103]: scientists['age_years_dt']
Out[103]:
0 37
1 61
2 90
3 66
4 56
5 45
6 41
7 77
Name: age_years_dt, dtype: int64
如果跟ages做比较会发现不等.
In [106]: scientists['age_years_dt'].equals(ages)
Out[106]: False
In [107]: type(ages)
Out[107]: pandas.core.series.Series
In [108]: ages
Out[108]:
0 66
1 56
2 41
3 77
4 90
5 45
6 37
7 61
Name: Age, dtype: int64
In [109]: scientists['age_years_dt']
Out[109]:
0 37
1 61
2 90
3 66
4 56
5 45
6 41
7 77
Name: age_years_dt, dtype: int64
In [110]: scientists['age_years_dt']==ages
Out[110]:
0 False
1 False
2 False
3 False
4 False
5 True
6 False
7 False
dtype: bool