3、诓骗与时刻序列漫衍相关
- # 查察二者的描写性统计,与时刻的序列漫衍相关
- print('Normal')
- print(crecreditcard_data.
- Time[crecreditcard_data.Class == 0].describe())
- print('-'*25)
- print('Fraud')
- print(crecreditcard_data.
- Time[crecreditcard_data.Class == 1].describe())
- Normal
- count 284315.000000
- mean 94838.202258
- std 47484.015786
- min 0.000000
- 25% 54230.000000
- 50% 84711.000000
- 75% 139333.000000
- max 172792.000000
- Name: Time, dtype: float64
- -------------------------
- Fraud
- count 492.000000
- mean 80746.806911
- std 47835.365138
- min 406.000000
- 25% 41241.500000
- 50% 75568.500000
- 75% 128483.000000
- max 170348.000000
- Name: Time, dtype: float64
- f,(ax1,ax2)=plt.subplots(2,1,sharex=True,figsize=(12,6))
- bins=50
- ax1.hist(crecreditcard_data.Time[crecreditcard_data.Class == 1],bins=bins)
- ax1.set_title('诓骗(Fraud))',fontsize=22)
- ax1.set_ylabel('买卖营业量',fontsize=15)
- ax2.hist(crecreditcard_data.Time[crecreditcard_data.Class == 0],bins=bins)
- ax2.set_title('正常(Normal',fontsize=22)
- plt.xlabel('时刻(单元:秒)',fontsize=15)
- plt.xticks(fontsize=15)
- plt.ylabel('买卖营业量',fontsize=15)
- # plt.yticks(fontsize=22)
- plt.show()
诓骗与时刻并没有肯定接洽,不存在周期性;
正常买卖营业有明明的周期性,有相同双峰这样的趋势。
4、诓骗与金额的相关和漫衍环境
- print('诓骗')
- print(crecreditcard_data.Amount[crecreditcard_data.Class ==1].describe())
- print('-'*25)
- print('正常买卖营业')
- print(crecreditcard_data.Amount[crecreditcard_data.Class==0].describe())
- 诓骗
- count 492.000000
- mean 122.211321
- std 256.683288
- min 0.000000
- 25% 1.000000
- 50% 9.250000
- 75% 105.890000
- max 2125.870000
- Name: Amount, dtype: float64
- -------------------------
- 正常买卖营业
- count 284315.000000
- mean 88.291022
- std 250.105092
- min 0.000000
- 25% 5.650000
- 50% 22.000000
- 75% 77.050000
- max 25691.160000
- Name: Amount, dtype: float64
- f,(ax1,ax2)=plt.subplots(2,1,sharex=True,figsize=(12,6))
- bins=30
- ax1.hist(crecreditcard_data.Amount[crecreditcard_data.Class == 1],bins=bins)
- ax1.set_title('诓骗(Fraud)',fontsize=22)
- ax1.set_ylabel('买卖营业量',fontsize=15)
- ax2.hist(crecreditcard_data.Amount[crecreditcard_data.Class == 0],bins=bins)
- ax2.set_title('正常(Normal)',fontsize=22)
- plt.xlabel('金额($)',fontsize=15)
- plt.xticks(fontsize=15)
- plt.ylabel('买卖营业量',fontsize=15)
- plt.yscale('log')
- plt.show()
金额广泛较低,可见金额这一列的数据对说明的参考代价不大。
5、查察各个自变量(V1-V29)与因变量的相关
看看各个变量与正常、诓骗之间是否存在接洽,为了更直观展示,通过distplot图来逐个判定,如下:
- features=[x for x in crecreditcard_data.columns
- if x not in ['Time','Amount','Class']]
- plt.figure(figsize=(12,28*4))
- gs =gridspec.GridSpec(28,1)
- import warnings
- warnings.filterwarnings('ignore')
- for i,cn in enumerate(crecreditcard_data[v_features]):
- ax=plt.subplot(gs[i])
- sns.distplot(crecreditcard_data[cn][crecreditcard_data.Class==1],bins=50,color='red')
- sns.distplot(crecreditcard_data[cn][crecreditcard_data.Class==0],bins=50,color='green')
- ax.set_xlabel('')
- ax.set_title('直方图:'+str(cn))
- plt.savefig('各个变量与class的相关.png',transparent=False,bbox_inches='tight')
- plt.show()
赤色暗示诓骗,绿色暗示正常
- 两个漫衍的交错面积越大,诓骗与正常的区分度最小,如V15;
- 两个漫衍的交错面积越小,则该变量对因变量的影响越大,如V14;
(编辑:湖南网)
【声明】本站内容均来自网络,其相关言论仅代表作者个人观点,不代表本站立场。若无意侵犯到您的权利,请及时与联系站长删除相关内容!
|