赛题背景:作为专业投资者,研究一家上市公司的财务数据是否稳健,需要考虑相关的诸多因素.面对上市公司多年的财务数据报告,投资者可通过数据挖掘,筛选数据指标进行跟踪分析和研究,识别真伪,避免投资踩雷.

要求:

(1) 根据各行业的上市公司所提供的财务数据,确定出各行业与财务数据造假相关的数据指标,并分析比较不同行业上市公司相关数据指标的异同.

(2) 根据制造业的各上市公司的财务数据,确定第 6 年财务数据造假的上市公司.

(3) 根据非制造业的上市公司的财务数据,确定第 6 年财务数据造假的上市公司.

意义:在大数据发展的时代下,通过人工智能、机器学习等智能化手段进行监控预测能够提高公司财务报告中的准确性、以及公司财务是否存在欺诈、隐瞒等行为,是提高公司信用风控的重要因素.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
import pandas as pd
import numpy as np
%matplotlib inline
import matplotlib.pyplot as plt

import seaborn as sns
color = sns.color_palette()

from scipy import stats
from scipy.stats import norm, skew

t1=pd.read_csv("制造业.csv")
t1_train=t1.drop("FLAG",axis=1)
t1
TICKER_SYMBOLACT_PUBTIMEPUBLISH_DATEEND_DATE_REPEND_DATEREPORT_TYPEFISCAL_PERIODMERGED_FLAGACCOUTING_STANDARDSCURRENCY_CD...CA_TURNOVEROPER_CYCLEINVEN_TURNOVERFA_TURNOVERTFA_TURNOVERDAYS_APDAYS_INVENTA_TURNOVERAR_TURNOVERFLAG
040193321A121CHAS_2007CNY...NaNNaNNaNNaNNaNNaNNaNNaNNaN0.0
181663321A121CHAS_2007CNY...NaNNaNNaNNaNNaNNaNNaNNaNNaN0.0
2117373321A121CHAS_2007CNY...NaNNaNNaNNaNNaNNaNNaNNaNNaN0.0
3164793321A121CHAS_2007CNY...NaNNaNNaNNaNNaNNaNNaNNaNNaN0.0
4168424431A121CHAS_2007CNY...NaNNaNNaNNaNNaNNaNNaNNaNNaN0.0
..................................................................
1396549922047776A121CHAS_2007CNY...NaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
1396649928587776A121CHAS_2007CNY...NaN0.000NaNNaNNaNNaNNaNNaNNaNNaN
1396749932017776A121CHAS_2007CNY...NaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
1396849988087776A121CHAS_2007CNY...NaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
1396949997097776A121CHAS_2007CNY...2.665623.08421.91790.65710.625633.658916.42490.369254.0618NaN

13970 rows × 363 columns

数据预处理

计算缺失率,并降序排序

1
2
3
4
all_data_na = (t1_train.isnull().sum() / len(t1_train) * 100).sort_values(ascending=False) 

missing_data = pd.DataFrame({'missing_data' : all_data_na})
missing_data
missing_data
ACCRUED_EXP99.971367
N_INC_BORR_OTH_FI99.806729
PERPETUAL_BOND_L99.634932
PREFERRED_STOCK_L99.606299
PREFERRED_STOCK_E99.591983
......
T_COMPR_INCOME0.000000
N_INCOME_ATTR_P0.000000
FINAN_EXP0.000000
ACT_PUBTIME0.000000
TICKER_SYMBOL0.000000

362 rows × 1 columns

将缺失率用图表的方式展示

1
2
3
4
5
6
f, ax = plt.subplots(figsize=(30, 15))
plt.xticks(rotation='90')
sns.barplot(x=all_data_na.index, y=all_data_na) #条形图
plt.xlabel('Features', fontsize=15)
plt.ylabel('Percent of missing values', fontsize=15)
plt.title('Percent missing data by feature', fontsize=15)
1
Text(0.5, 1.0, 'Percent missing data by feature')
1
2
3
4
5
6
7
# 统计缺失率大于80%的个数
missing_data_count1 = all_data_na.index[all_data_na > 80]

# 统计缺失率小于20%的个数
missing_data_count2 = all_data_na.index[all_data_na < 20]

print(missing_data_count1.shape,missing_data_count2.shape)
1
(93,) (84,)
1
2
3
4
#缺失率>80%的特征
a=missing_data.values[:93]
x=pd.DataFrame(a, index = missing_data.index[:93])
x
0
ACCRUED_EXP99.971367
N_INC_BORR_OTH_FI99.806729
PERPETUAL_BOND_L99.634932
PREFERRED_STOCK_L99.606299
PREFERRED_STOCK_E99.591983
......
OP_CL81.338583
R_D81.159628
N_CF_OPA_LIAB80.952040
N_CF_NFA_LIAB80.952040
OP_TL80.916249

93 rows × 1 columns

删除80%以上的缺失率

1
2
t2=t1_train.drop(columns=x.index)
t2
TICKER_SYMBOLACT_PUBTIMEPUBLISH_DATEEND_DATE_REPEND_DATEREPORT_TYPEFISCAL_PERIODMERGED_FLAGACCOUTING_STANDARDSCURRENCY_CD...AP_TURNOVERCA_TURNOVEROPER_CYCLEINVEN_TURNOVERFA_TURNOVERTFA_TURNOVERDAYS_APDAYS_INVENTA_TURNOVERAR_TURNOVER
040193321A121CHAS_2007CNY...NaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
181663321A121CHAS_2007CNY...NaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
2117373321A121CHAS_2007CNY...NaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
3164793321A121CHAS_2007CNY...NaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
4168424431A121CHAS_2007CNY...NaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
..................................................................
1396549922047776A121CHAS_2007CNY...NaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
1396649928587776A121CHAS_2007CNY...NaNNaN0.000NaNNaNNaNNaNNaNNaNNaN
1396749932017776A121CHAS_2007CNY...NaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
1396849988087776A121CHAS_2007CNY...NaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
1396949997097776A121CHAS_2007CNY...10.69562.665623.08421.91790.65710.625633.658916.42490.369254.0618

13970 rows × 269 columns

对缺失率20%到80%的数据填充中位数

1
2
3
4
b=missing_data.index[93:278]
for o in b:
t2[o]=t2[o].fillna(t2[o].median())
t2
TICKER_SYMBOLACT_PUBTIMEPUBLISH_DATEEND_DATE_REPEND_DATEREPORT_TYPEFISCAL_PERIODMERGED_FLAGACCOUTING_STANDARDSCURRENCY_CD...AP_TURNOVERCA_TURNOVEROPER_CYCLEINVEN_TURNOVERFA_TURNOVERTFA_TURNOVERDAYS_APDAYS_INVENTA_TURNOVERAR_TURNOVER
040193321A121CHAS_2007CNY...4.86171.0942149.72934.11203.06962.714574.3051587.751750.53548.49245
181663321A121CHAS_2007CNY...4.86171.0942149.72934.11203.06962.714574.3051587.751750.53548.49245
2117373321A121CHAS_2007CNY...4.86171.0942149.72934.11203.06962.714574.3051587.751750.53548.49245
3164793321A121CHAS_2007CNY...4.86171.0942149.72934.11203.06962.714574.3051587.751750.53548.49245
4168424431A121CHAS_2007CNY...4.86171.0942149.72934.11203.06962.714574.3051587.751750.53548.49245
..................................................................
1396549922047776A121CHAS_2007CNY...4.86171.0942149.72934.11203.06962.714574.3051587.751750.53548.49245
1396649928587776A121CHAS_2007CNY...4.86171.09420.00004.11203.06962.714574.3051587.751750.53548.49245
1396749932017776A121CHAS_2007CNY...4.86171.0942149.72934.11203.06962.714574.3051587.751750.53548.49245
1396849988087776A121CHAS_2007CNY...4.86171.0942149.72934.11203.06962.714574.3051587.751750.53548.49245
1396949997097776A121CHAS_2007CNY...10.69562.665623.084021.91790.65710.625633.6589016.424900.369254.06180

13970 rows × 269 columns

对缺失率20%以下的数据使用KNN填充

1
2
3
4
5
6
7
d=missing_data.index[278:336] #列名

from sklearn.impute import KNNImputer

imputer = KNNImputer(n_neighbors=10)
t2[d] = imputer.fit_transform(t2[d])
print(t2.isnull().sum())
1
2
3
4
5
6
7
8
9
10
11
12
TICKER_SYMBOL    0
ACT_PUBTIME 0
PUBLISH_DATE 0
END_DATE_REP 0
END_DATE 0
..
TFA_TURNOVER 0
DAYS_AP 0
DAYS_INVEN 0
TA_TURNOVER 0
AR_TURNOVER 0
Length: 269, dtype: int64

删除与预测是否造假结果无关的特征因子

删除股票代码,实际披露时间,发布时间,报告截止日期,截止日期,报告类型,会计区间,合并标志:1-合并,2-母公司,会计准则,货币代码共 10 个与预测是否造假结果无关的特征因子

1
t2=t2.drop(["TICKER_SYMBOL","ACT_PUBTIME","PUBLISH_DATE","END_DATE_REP","END_DATE","REPORT_TYPE","FISCAL_PERIOD","MERGED_FLAG","ACCOUTING_STANDARDS","CURRENCY_CD"],axis=1)

查看是否还存在缺失值

1
t2.isna().any().sum()  
1
0

对数据进行标准化

1
2
3
4
5
from sklearn.preprocessing import StandardScaler

#标准化,返回值为标准化后的数据
t4=pd.DataFrame(StandardScaler().fit_transform(t2),columns=t2.columns)
t4
CASH_C_EQUIVNOTES_RECEIVARPREPAYMENTINT_RECEIVOTH_RECEIVINVENTORIESOTH_CAT_CAAVAIL_FOR_SALE_FA...AP_TURNOVERCA_TURNOVEROPER_CYCLEINVEN_TURNOVERFA_TURNOVERTFA_TURNOVERDAYS_APDAYS_INVENTA_TURNOVERAR_TURNOVER
0-0.110544-0.106696-0.161667-0.182694-0.067294-0.177580-0.271929-0.054680-0.201905-0.087742...-0.071554-0.120861-0.030988-0.022326-0.046674-0.012361-0.008952-0.031444-0.112856-0.047602
1-0.0364961.088871-0.182107-0.052401-0.085668-0.0265580.016419-0.1719270.060346-0.087742...-0.071554-0.120861-0.030988-0.022326-0.046674-0.012361-0.008952-0.031444-0.112856-0.047602
20.070766-0.1892230.057981-0.1408680.021829-0.115114-0.1008010.073932-0.023286-0.110754...-0.071554-0.120861-0.030988-0.022326-0.046674-0.012361-0.008952-0.031444-0.112856-0.047602
3-0.039637-0.205146-0.184401-0.159863-0.062639-0.060387-0.1976510.346521-0.105029-0.087742...-0.071554-0.120861-0.030988-0.022326-0.046674-0.012361-0.008952-0.031444-0.112856-0.047602
4-0.244743-0.199970-0.265148-0.148300-0.085668-0.182752-0.279125-0.178592-0.283117-0.087742...-0.071554-0.120861-0.030988-0.022326-0.046674-0.012361-0.008952-0.031444-0.112856-0.047602
..................................................................
13965-0.245654-0.175257-0.248184-0.192613-0.085668-0.180662-0.279316-0.178050-0.279115-0.087742...-0.071554-0.120861-0.030988-0.022326-0.046674-0.012361-0.008952-0.031444-0.112856-0.047602
13966-0.204023-0.205182-0.257308-0.191965-0.085668-0.175087-0.270255-0.175323-0.266459-0.087742...-0.071554-0.120861-0.063158-0.022326-0.046674-0.012361-0.008952-0.031444-0.112856-0.047602
13967-0.227119-0.204127-0.201336-0.164736-0.085668-0.164288-0.183161-0.162139-0.237732-0.087742...-0.071554-0.120861-0.030988-0.022326-0.046674-0.012361-0.008952-0.031444-0.112856-0.047602
139680.100220-0.204577-0.038156-0.128786-0.085668-0.1281730.075970-0.152256-0.019633-0.068500...-0.071554-0.120861-0.030988-0.022326-0.046674-0.012361-0.008952-0.031444-0.112856-0.047602
139691.6095540.2223991.0254780.436742-0.0856680.5816710.6912560.0195371.080323-0.087742...0.3065812.042642-0.0581980.017576-0.069761-0.013200-0.009263-0.050238-0.593860-0.006262

13970 rows × 259 columns

划分数据集

以前5年数据为训练集、验证集train,第6年为测试集test

1
2
3
#以前5年数据为训练集、验证集train,第6年为测试集test
train=t4.iloc[:11310,:]
test=t4.iloc[11310:,:259]
1
2
train["FLAG"]=t1["FLAG"]
train
CASH_C_EQUIVNOTES_RECEIVARPREPAYMENTINT_RECEIVOTH_RECEIVINVENTORIESOTH_CAT_CAAVAIL_FOR_SALE_FA...CA_TURNOVEROPER_CYCLEINVEN_TURNOVERFA_TURNOVERTFA_TURNOVERDAYS_APDAYS_INVENTA_TURNOVERAR_TURNOVERFLAG
0-0.110544-0.106696-0.161667-0.182694-0.067294-0.177580-0.271929-0.054680-0.201905-0.087742...-0.120861-0.030988-0.022326-0.046674-0.012361-0.008952-0.031444-0.112856-0.0476020.0
1-0.0364961.088871-0.182107-0.052401-0.085668-0.0265580.016419-0.1719270.060346-0.087742...-0.120861-0.030988-0.022326-0.046674-0.012361-0.008952-0.031444-0.112856-0.0476020.0
20.070766-0.1892230.057981-0.1408680.021829-0.115114-0.1008010.073932-0.023286-0.110754...-0.120861-0.030988-0.022326-0.046674-0.012361-0.008952-0.031444-0.112856-0.0476020.0
3-0.039637-0.205146-0.184401-0.159863-0.062639-0.060387-0.1976510.346521-0.105029-0.087742...-0.120861-0.030988-0.022326-0.046674-0.012361-0.008952-0.031444-0.112856-0.0476020.0
4-0.244743-0.199970-0.265148-0.148300-0.085668-0.182752-0.279125-0.178592-0.283117-0.087742...-0.120861-0.030988-0.022326-0.046674-0.012361-0.008952-0.031444-0.112856-0.0476020.0
..................................................................
11305-0.248180-0.177748-0.244404-0.195324-0.085668-0.182942-0.277525-0.177054-0.279415-0.087742...-0.120861-0.030988-0.022326-0.046674-0.012361-0.008952-0.031444-0.112856-0.0476020.0
11306-0.218672-0.196336-0.255531-0.193333-0.085668-0.160477-0.268560-0.174560-0.270623-0.087742...-1.5872912.125000-0.031456-0.041665-0.0123920.0050112.477695-1.585968-0.0547370.0
11307-0.200565-0.204200-0.232985-0.177734-0.085668-0.175507-0.207126-0.160690-0.242246-0.087742...-0.120861-0.030988-0.022326-0.046674-0.012361-0.008952-0.031444-0.112856-0.0476020.0
11308-0.101380-0.197020-0.049710-0.100780-0.085668-0.1782310.042636-0.123428-0.095392-0.064501...-0.120861-0.030988-0.022326-0.046674-0.012361-0.008952-0.031444-0.112856-0.0476020.0
113091.3260970.1416510.8892890.1701260.0292340.5159640.5293390.0244920.8543252.263116...0.710450-0.0583650.016279-0.071072-0.013249-0.009230-0.050121-0.7197550.0047470.0

11310 rows × 260 columns

1
2
3
import pandas as pd
train.to_excel("训练集、验证集.xlsx")
test.to_excel("测试集.xlsx")

样本不均衡处理

1
2
X_train1=np.array(train.iloc[:11310,:259])
y_train1 =train.FLAG.values
1
2
3
4
from collections import Counter

# 查看所生成的样本类别分布,0和1样本比例9比1,属于类别不平衡数据
print(Counter(y_train1))
1
Counter({0.0: 11219, 1.0: 91})
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
import matplotlib.pyplot as plt

# make data
x = [11219, 91]
labels = ['0', '1']

# plot
fig, ax = plt.subplots()
ax.pie(x, radius=3, center=(4, 4),labels=labels,
wedgeprops={"linewidth": 1, "edgecolor": "white"}, autopct='%.1f%%', frame=True)

ax.set(xlim=(0, 8), xticks=np.arange(1, 8),
ylim=(0, 8), yticks=np.arange(1, 8))

plt.show()
1
2
3
4
5
6
from imblearn.over_sampling import SMOTE

# 生成0和1比例为3比1的数据样本
oversample = SMOTE(sampling_strategy=0.2,random_state=42)
X_os, y_os = oversample.fit_resample(X_train1,y_train1)
print(Counter(y_os))
1
Counter({0.0: 11219, 1.0: 2243})
1
X_os.shape
1
(13462, 259)
1
2
3
4
5
import pandas as pd
a1 = pd.DataFrame(X_os)
a1["259"] = y_os
a1.columns = train.columns #添加列名
a1
CASH_C_EQUIVNOTES_RECEIVARPREPAYMENTINT_RECEIVOTH_RECEIVINVENTORIESOTH_CAT_CAAVAIL_FOR_SALE_FA...CA_TURNOVEROPER_CYCLEINVEN_TURNOVERFA_TURNOVERTFA_TURNOVERDAYS_APDAYS_INVENTA_TURNOVERAR_TURNOVERFLAG
0-0.110544-0.106696-0.161667-0.182694-0.067294-0.177580-0.271929-0.054680-0.201905-0.087742...-0.120861-0.030988-0.022326-0.046674-0.012361-0.008952-0.031444-0.112856-0.0476020.0
1-0.0364961.088871-0.182107-0.052401-0.085668-0.0265580.016419-0.1719270.060346-0.087742...-0.120861-0.030988-0.022326-0.046674-0.012361-0.008952-0.031444-0.112856-0.0476020.0
20.070766-0.1892230.057981-0.1408680.021829-0.115114-0.1008010.073932-0.023286-0.110754...-0.120861-0.030988-0.022326-0.046674-0.012361-0.008952-0.031444-0.112856-0.0476020.0
3-0.039637-0.205146-0.184401-0.159863-0.062639-0.060387-0.1976510.346521-0.105029-0.087742...-0.120861-0.030988-0.022326-0.046674-0.012361-0.008952-0.031444-0.112856-0.0476020.0
4-0.244743-0.199970-0.265148-0.148300-0.085668-0.182752-0.279125-0.178592-0.283117-0.087742...-0.120861-0.030988-0.022326-0.046674-0.012361-0.008952-0.031444-0.112856-0.0476020.0
..................................................................
13457-0.194605-0.204111-0.235016-0.192665-0.095194-0.166159-0.266567-0.157112-0.255136-0.082809...-0.120861-0.030988-0.022326-0.046674-0.012361-0.008952-0.031444-0.112856-0.0476021.0
13458-0.231584-0.196071-0.240270-0.175277-0.0856680.100651-0.104562-0.135625-0.215914-0.093468...-0.120861-0.030988-0.022326-0.046674-0.012361-0.008952-0.031444-0.112856-0.0476021.0
13459-0.172396-0.090448-0.126067-0.083162-0.085668-0.109957-0.217281-0.088728-0.182285-0.087742...-0.459342-0.013455-0.022908-0.067021-0.013149-0.008900-0.020502-0.563696-0.0517471.0
134600.2202130.4284070.5390640.1298780.9309310.1521190.2619900.1675060.343409-0.036143...-0.120861-0.030988-0.022326-0.046674-0.012361-0.008952-0.031444-0.112856-0.0476021.0
134610.0152220.0927240.1566570.054408-0.085668-0.1045270.074379-0.1549460.042608-0.085088...-0.4629180.015973-0.0255810.7707810.029360-0.0089980.032707-0.2618380.2199231.0

13462 rows × 260 columns

1
2
a2 = a1.drop("FLAG",axis=1)
a2
CASH_C_EQUIVNOTES_RECEIVARPREPAYMENTINT_RECEIVOTH_RECEIVINVENTORIESOTH_CAT_CAAVAIL_FOR_SALE_FA...AP_TURNOVERCA_TURNOVEROPER_CYCLEINVEN_TURNOVERFA_TURNOVERTFA_TURNOVERDAYS_APDAYS_INVENTA_TURNOVERAR_TURNOVER
0-0.110544-0.106696-0.161667-0.182694-0.067294-0.177580-0.271929-0.054680-0.201905-0.087742...-0.071554-0.120861-0.030988-0.022326-0.046674-0.012361-0.008952-0.031444-0.112856-0.047602
1-0.0364961.088871-0.182107-0.052401-0.085668-0.0265580.016419-0.1719270.060346-0.087742...-0.071554-0.120861-0.030988-0.022326-0.046674-0.012361-0.008952-0.031444-0.112856-0.047602
20.070766-0.1892230.057981-0.1408680.021829-0.115114-0.1008010.073932-0.023286-0.110754...-0.071554-0.120861-0.030988-0.022326-0.046674-0.012361-0.008952-0.031444-0.112856-0.047602
3-0.039637-0.205146-0.184401-0.159863-0.062639-0.060387-0.1976510.346521-0.105029-0.087742...-0.071554-0.120861-0.030988-0.022326-0.046674-0.012361-0.008952-0.031444-0.112856-0.047602
4-0.244743-0.199970-0.265148-0.148300-0.085668-0.182752-0.279125-0.178592-0.283117-0.087742...-0.071554-0.120861-0.030988-0.022326-0.046674-0.012361-0.008952-0.031444-0.112856-0.047602
..................................................................
13457-0.194605-0.204111-0.235016-0.192665-0.095194-0.166159-0.266567-0.157112-0.255136-0.082809...-0.071554-0.120861-0.030988-0.022326-0.046674-0.012361-0.008952-0.031444-0.112856-0.047602
13458-0.231584-0.196071-0.240270-0.175277-0.0856680.100651-0.104562-0.135625-0.215914-0.093468...-0.071554-0.120861-0.030988-0.022326-0.046674-0.012361-0.008952-0.031444-0.112856-0.047602
13459-0.172396-0.090448-0.126067-0.083162-0.085668-0.109957-0.217281-0.088728-0.182285-0.087742...0.103209-0.459342-0.013455-0.022908-0.067021-0.013149-0.008900-0.020502-0.563696-0.051747
134600.2202130.4284070.5390640.1298780.9309310.1521190.2619900.1675060.343409-0.036143...-0.071554-0.120861-0.030988-0.022326-0.046674-0.012361-0.008952-0.031444-0.112856-0.047602
134610.0152220.0927240.1566570.054408-0.085668-0.1045270.074379-0.1549460.042608-0.085088...-0.040714-0.4629180.015973-0.0255810.7707810.029360-0.0089980.032707-0.2618380.219923

13462 rows × 259 columns

划分训练集、验证集

1
2
3
4
5
6
#前 5 年制造业数据分别进行训练集与验证集的切割
from sklearn.model_selection import train_test_split
import pandas as pd
train_data,test_data1 = train_test_split(a1,test_size = 0.2,random_state=0)
#验证集
test_data1
CASH_C_EQUIVNOTES_RECEIVARPREPAYMENTINT_RECEIVOTH_RECEIVINVENTORIESOTH_CAT_CAAVAIL_FOR_SALE_FA...CA_TURNOVEROPER_CYCLEINVEN_TURNOVERFA_TURNOVERTFA_TURNOVERDAYS_APDAYS_INVENTA_TURNOVERAR_TURNOVERFLAG
10307-0.237276-0.149240-0.209804-0.164470-0.085668-0.159014-0.254104-0.178466-0.256564-0.087742...-0.120861-0.030988-0.022326-0.046674-0.012361-0.008952-0.031444-0.112856-0.0476020.0
69132.8601311.5945380.4802082.647223-0.0856681.0975098.6811121.6153393.9032710.161419...-1.074846-0.055897-0.007669-0.055792-0.012656-0.009343-0.045661-0.929869-0.0476020.0
7530-0.236536-0.178238-0.227843-0.184537-0.085668-0.175392-0.267297-0.177575-0.268344-0.087742...-0.120861-0.030988-0.022326-0.046674-0.012361-0.008952-0.031444-0.112856-0.0476020.0
8204-0.247465-0.196230-0.194191-0.192289-0.085668-0.174138-0.264405-0.158679-0.264602-0.087742...-0.120861-0.030988-0.022326-0.046674-0.012361-0.008952-0.031444-0.112856-0.0476020.0
112120.204918-0.1671060.011418-0.1559470.087527-0.0772060.006010-0.0416890.031319-0.087742...-0.120861-0.030988-0.022326-0.046674-0.012361-0.008952-0.031444-0.112856-0.0476020.0
..................................................................
11098-0.204823-0.191856-0.257564-0.195262-0.085668-0.181840-0.262093-0.140187-0.261923-0.087742...-0.120861-0.030988-0.022326-0.046674-0.012361-0.008952-0.031444-0.112856-0.0476020.0
451-0.146761-0.059989-0.141490-0.150467-0.087393-0.178277-0.149462-0.135039-0.148100-0.064996...-0.120861-0.030988-0.022326-0.046674-0.012361-0.008952-0.031444-0.112856-0.0476020.0
5634-0.217310-0.048995-0.150073-0.185264-0.085668-0.147963-0.269352-0.144624-0.223935-0.085432...-0.120861-0.030988-0.022326-0.046674-0.012361-0.008952-0.031444-0.112856-0.0476020.0
4379-0.225533-0.153682-0.128768-0.162863-0.111326-0.171458-0.1513840.109614-0.174713-0.087742...-0.120861-0.030988-0.022326-0.046674-0.012361-0.008952-0.031444-0.112856-0.0476020.0
7712-0.229782-0.188186-0.269640-0.172389-0.085668-0.171999-0.278531-0.171582-0.279448-0.087742...-0.120861-0.030988-0.022326-0.046674-0.012361-0.008952-0.031444-0.112856-0.0476020.0

2693 rows × 260 columns

1
2
#训练集
train_data
CASH_C_EQUIVNOTES_RECEIVARPREPAYMENTINT_RECEIVOTH_RECEIVINVENTORIESOTH_CAT_CAAVAIL_FOR_SALE_FA...CA_TURNOVEROPER_CYCLEINVEN_TURNOVERFA_TURNOVERTFA_TURNOVERDAYS_APDAYS_INVENTA_TURNOVERAR_TURNOVERFLAG
11732-0.115619-0.159654-0.215413-0.066286-0.055511-0.135853-0.226678-0.152113-0.203331-0.087742...2.124952-0.047612-0.005608-0.059694-0.012875-0.009210-0.0421060.6746420.0818171.0
2849-0.232070-0.138400-0.136509-0.174268-0.085668-0.166162-0.091276-0.171693-0.196572-0.087742...-0.120861-0.030988-0.022326-0.046674-0.012361-0.008952-0.031444-0.112856-0.0476020.0
49380.1470700.3163450.1391800.2283390.0285651.242537-0.003413-0.1675090.159136-0.087742...-0.120861-0.030988-0.022326-0.046674-0.012361-0.008952-0.031444-0.112856-0.0476020.0
10029-0.214163-0.154632-0.230625-0.187055-0.085668-0.165990-0.213737-0.150521-0.240806-0.087742...-0.120861-0.030988-0.022326-0.046674-0.012361-0.008952-0.031444-0.112856-0.0476020.0
5420-0.210466-0.171303-0.145414-0.145187-0.085668-0.153739-0.053173-0.062340-0.168963-0.087742...-0.120861-0.030988-0.022326-0.046674-0.012361-0.008952-0.031444-0.112856-0.0476020.0
..................................................................
13123-0.244645-0.113926-0.200190-0.112910-0.095363-0.145159-0.109428-0.093625-0.201378-0.092453...-0.9582180.086090-0.028829-0.063935-0.013042-0.0079760.044354-1.065248-0.0536821.0
3264-0.230011-0.182265-0.250535-0.182594-0.114804-0.176142-0.289794-0.178597-0.276976-0.087742...-0.120861-0.030988-0.022326-0.046674-0.012361-0.008952-0.031444-0.112856-0.0476020.0
9845-0.078908-0.201787-0.191256-0.195170-0.085668-0.155015-0.258678-0.178345-0.205729-0.087742...-0.812152-0.0545650.101067-0.071234-0.013280-0.009285-0.052963-1.054317-0.0456761.0
10799-0.239831-0.174050-0.177255-0.192585-0.085668-0.101249-0.273990-0.151724-0.254010-0.103202...-0.120861-0.030988-0.022326-0.046674-0.012361-0.008952-0.031444-0.112856-0.0476020.0
27321.7021190.274794-0.0370510.1897072.197139-0.0872280.200058-0.1447830.668527-0.087742...0.010898-0.026220-0.022478-0.055215-0.012707-0.009076-0.0311090.198842-0.0513660.0

10769 rows × 260 columns

1
2
#删除验证集FLAG
test_data2=test_data1.drop("FLAG",axis=1)

造假指标模型建立

1
2
3
4
5
from sklearn.metrics import roc_curve, auc
from sklearn import metrics
from sklearn.metrics import auc
#特征重要性选择
from xgboost import plot_importance
1
2
3
4
5
#训练集数据
X_train=np.array(train_data.iloc[:,:259])
y_train =np.array(train_data["FLAG"])
#验证集数据
y=np.array(test_data1["FLAG"])
1
2
feature_1 = a1.drop('FLAG',axis = 1)
feature_1
CASH_C_EQUIVNOTES_RECEIVARPREPAYMENTINT_RECEIVOTH_RECEIVINVENTORIESOTH_CAT_CAAVAIL_FOR_SALE_FALT_EQUITY_INVESTINVEST_REAL_ESTATEFIXED_ASSETSCIPINTAN_ASSETSGOODWILLLT_AMOR_EXPDEFER_TAX_ASSETSOTH_NCAT_NCAT_ASSETSST_BORRNOTES_PAYABLEAPADVANCE_RECEIPTSPAYROLL_PAYABLETAXES_PAYABLEINT_PAYABLEDIV_PAYABLEOTH_PAYABLENCL_WITHIN_1YOTH_CLT_CLLT_BORRLT_PAYABLEESTIMATED_LIABDEFER_REVENUEDEFER_TAX_LIABT_NCLT_LIABPAID_IN_CAPITALCAPITAL_RESERSPECIAL_RESERSURPLUS_RESERRETAINED_EARNINGST_EQUITY_ATTR_PMINORITY_INTT_SH_EQUITYT_LIAB_EQUITYOTH_COMPRE_INCOMEC_PAID_OTH_FINAN_AN_CF_FR_INVEST_AC_FR_BORRN_CF_OPERATE_AC_FR_CAP_CONTRC_PAID_INVESTC_FR_OTH_FINAN_AC_PAID_OTH_INVEST_AC_INF_FR_INVEST_AC_PAID_G_S...TSE_TAC_TATEAP_ICLT_AMOR_EXP_TANCA_TAST_BORR_TANCL_TAEQU_MULTIPLIERCAP_FIX_RATION_TAN_A_TAREPAY_TAID_ICAP_TAINVEN_TACL_TAADV_R_TAAR_TATEAP_TAT_FIXED_A_TAFIXED_A_TATRE_TACA_TAINTAN_A_TAAIL_TRVAL_CHG_P_TRCOGS_TRSELL_EXP_TRPERIOD_EXP_TRINV_INC_TRIT_TPOPA_P_TPOP_TRFINAN_EXP_TRVAL_CHG_P_TPNI_CUT_NIOPA_P_TRN_NOPI_TPR_TRNOPG_TRNI_TRTCOGS_TRTP_TRNOPL_TRADMIN_EXP_TREBITDA_TRBTAX_SURCHG_TRIT_TREBIT_TROP_TPDAYS_ARAP_TURNOVERCA_TURNOVEROPER_CYCLEINVEN_TURNOVERFA_TURNOVERTFA_TURNOVERDAYS_APDAYS_INVENTA_TURNOVERAR_TURNOVER
0-0.110544-0.106696-0.161667-0.182694-0.067294-0.177580-0.271929-0.054680-0.201905-0.087742-0.142028-0.086266-0.200848-0.164105-0.277532-0.163757-0.234584-0.116027-0.114840-0.232795-0.228647-0.229273-0.225301-0.204865-0.163052-0.185749-0.160000-0.198005-0.077002-0.183300-0.187825-0.081143-0.254193-0.203991-0.139548-0.074228-0.097049-0.109180-0.209598-0.257919-0.2529880.005258-0.146179-0.153425-0.171217-0.157934-0.154017-0.167159-0.228648-0.048196-0.1571680.275978-0.286845-0.143991-0.194064-0.048096-0.153902-0.159625-0.019451-0.183022...0.011147-0.1276550.007816-0.090990-0.018944-0.014594-0.009392-0.020234-0.0797710.011144-0.158489-0.020358-0.016036-0.144829-0.011974-0.010509-0.1680160.011145-0.132080-0.1269370.0115620.019110-0.139211-0.008475-0.0440780.087840-0.156706-0.008489-0.0416480.0106300.0165450.008475-0.008480-0.0115690.0345110.008481-0.0228970.038886-0.0087840.008472-0.0084810.008471-0.008531-0.008533-0.003868-0.044050-0.0541460.0084590.022926-0.012153-0.071554-0.120861-0.030988-0.022326-0.046674-0.012361-0.008952-0.031444-0.112856-0.047602
1-0.0364961.088871-0.182107-0.052401-0.085668-0.0265580.016419-0.1719270.060346-0.087742-0.153631-0.086266-0.186912-0.173041-0.111736-0.163757-0.172424-0.023101-0.114840-0.209060-0.067471-0.2292730.8277240.0935450.0924870.0693310.087165-0.198005-0.077002-0.053152-0.187825-0.0766460.043979-0.150792-0.139548-0.0742280.100215-0.109180-0.186920-0.010804-0.109135-0.403590-0.1461790.0154580.038215-0.141095-0.154017-0.152307-0.067472-0.048196-0.1571680.182495-0.2321850.099247-0.194064-0.121115-0.153902-0.159625-0.154439-0.020811...0.011147-0.1276550.007816-0.090990-0.018944-0.014594-0.009392-0.020234-0.0797710.011144-0.158489-0.020358-0.016036-0.144829-0.011974-0.010509-0.1680160.011145-0.132080-0.1269370.0115620.019110-0.139211-0.008475-0.0440780.087840-0.156706-0.008489-0.0416480.0106300.0165450.008475-0.008480-0.0115690.0345110.008481-0.0228970.038886-0.0087840.008472-0.0084810.008471-0.008531-0.008533-0.003868-0.044050-0.0541460.0084590.022926-0.012153-0.071554-0.120861-0.030988-0.022326-0.046674-0.012361-0.008952-0.031444-0.112856-0.047602
20.070766-0.1892230.057981-0.1408680.021829-0.115114-0.1008010.073932-0.023286-0.110754-0.067454-0.120642-0.2001080.077432-0.1489040.586702-0.089601-0.113231-0.059961-0.105702-0.064816-0.316157-0.231363-0.137904-0.011700-0.191030-0.024851-0.198005-0.105756-0.108134-0.187825-0.076646-0.200263-0.150792-0.139548-0.0742280.260842-0.062466-0.163225-0.202685-0.0291540.651345-0.146179-0.042070-0.0448310.188303-0.0640060.157133-0.064817-0.048196-0.157168-0.431703-0.2503910.005059-0.2436320.327603-0.153902-0.1596250.228867-0.158202...0.011147-0.1276550.007816-0.090990-0.018944-0.014594-0.009392-0.020234-0.0797710.011144-0.158489-0.020358-0.016036-0.144829-0.011974-0.010509-0.1680160.011145-0.132080-0.1269370.0115620.019110-0.139211-0.008475-0.0440780.087840-0.156706-0.008489-0.0416480.0106300.0165450.008475-0.008480-0.0115690.0345110.008481-0.0228970.038886-0.0087840.008472-0.0084810.008471-0.008531-0.008533-0.003868-0.044050-0.0541460.0084590.022926-0.012153-0.071554-0.120861-0.030988-0.022326-0.046674-0.012361-0.008952-0.031444-0.112856-0.047602
3-0.039637-0.205146-0.184401-0.159863-0.062639-0.060387-0.1976510.346521-0.105029-0.087742-0.120960-0.086266-0.221463-0.199303-0.177723-0.171951-0.027574-0.127111-0.117747-0.236460-0.175166-0.230604-0.164503-0.125771-0.142193-0.045072-0.156540-0.209112-0.077002-0.148004-0.242135-0.076646-0.188544-0.116849-0.139548-0.074228-0.128884-0.039623-0.162012-0.192893-0.2862600.076386-0.146179-0.171182-0.131627-0.123079-0.177217-0.135453-0.175168-0.062454-0.1571680.120646-0.198916-0.093394-0.2455380.044030-0.1899500.1052940.073860-0.109958...0.011147-0.1276550.007816-0.090990-0.018944-0.014594-0.009392-0.020234-0.0797710.011144-0.158489-0.020358-0.016036-0.144829-0.011974-0.010509-0.1680160.011145-0.132080-0.1269370.0115620.019110-0.139211-0.008475-0.0440780.087840-0.156706-0.008489-0.0416480.0106300.0165450.008475-0.008480-0.0115690.0345110.008481-0.0228970.038886-0.0087840.008472-0.0084810.008471-0.008531-0.008533-0.003868-0.044050-0.0541460.0084590.022926-0.012153-0.071554-0.120861-0.030988-0.022326-0.046674-0.012361-0.008952-0.031444-0.112856-0.047602
4-0.244743-0.199970-0.265148-0.148300-0.085668-0.182752-0.279125-0.178592-0.283117-0.087742-0.120960-0.086266-0.282834-0.196378-0.306345-0.163757-0.189353-0.135653-0.114840-0.288391-0.302069-0.312094-0.194654-0.204394-0.168588-0.229982-0.185895-0.217441-0.108710-0.183572-0.187825-0.076646-0.259576-0.150792-0.139548-0.074228-0.162071-0.109180-0.218931-0.264605-0.401198-0.402463-0.146179-0.176925-0.219024-0.355697-0.179565-0.341002-0.302070-0.048196-0.1764210.224868-0.285642-0.201230-0.194064-0.121115-0.153902-0.159625-0.156845-0.177266...0.011147-0.1276550.007816-0.090990-0.018944-0.014594-0.009392-0.020234-0.0797710.011144-0.158489-0.020358-0.016036-0.144829-0.011974-0.010509-0.1680160.011145-0.132080-0.1269370.0115620.019110-0.139211-0.008475-0.0440780.087840-0.156706-0.008489-0.0416480.0106300.0165450.008475-0.008480-0.0115690.0345110.008481-0.0228970.038886-0.0087840.008472-0.0084810.008471-0.008531-0.008533-0.003868-0.044050-0.0541460.0084590.022926-0.012153-0.071554-0.120861-0.030988-0.022326-0.046674-0.012361-0.008952-0.031444-0.112856-0.047602
..............................................................................................................................................................................................................................................................................................................................................................................
13457-0.194605-0.204111-0.235016-0.192665-0.095194-0.166159-0.266567-0.157112-0.255136-0.082809-0.144837-0.086266-0.234505-0.179108-0.247421-0.173251-0.202890-0.117034-0.115165-0.249601-0.267204-0.241517-0.233453-0.203503-0.160887-0.201682-0.171888-0.198005-0.077002-0.156267-0.187825-0.076646-0.251653-0.150792-0.139548-0.085263-0.142678-0.124578-0.215774-0.257401-0.138162-0.345235-0.146179-0.164382-0.194570-0.269117-0.180658-0.264808-0.267205-0.048232-0.1571680.282769-0.277337-0.214423-0.237723-0.089799-0.174903-0.159625-0.081430-0.168416...0.011147-0.1276550.007816-0.090990-0.018944-0.014594-0.009392-0.020234-0.0797710.011144-0.158489-0.020358-0.016036-0.144829-0.011974-0.010509-0.1680160.011145-0.132080-0.1269370.0115620.019110-0.139211-0.008475-0.0440780.087840-0.156706-0.008489-0.0416480.0106300.0165450.008475-0.008480-0.0115690.0345110.008481-0.0228970.038886-0.0087840.008472-0.0084810.008471-0.008531-0.008533-0.003868-0.044050-0.0541460.0084590.022926-0.012153-0.071554-0.120861-0.030988-0.022326-0.046674-0.012361-0.008952-0.031444-0.112856-0.047602
13458-0.231584-0.196071-0.240270-0.175277-0.0856680.100651-0.104562-0.135625-0.215914-0.093468-0.0752700.064067-0.170791-0.197127-0.291951-0.174709-0.235937-0.132742-0.119921-0.205796-0.223482-0.183847-0.242438-0.113028-0.165859-0.198643-0.352867-0.101828-0.0788670.014294-0.243134-0.077372-0.178927-0.182819-0.1395480.072025-0.158947-0.109180-0.191181-0.192376-0.015725-0.131649-0.146179-0.104321-0.381811-0.255928-0.208644-0.257609-0.223483-0.037582-0.1571680.219093-0.215293-0.297951-0.196203-0.1266490.038585-0.159625-0.140010-0.169934...0.011147-0.1276550.007816-0.090990-0.018944-0.014594-0.009392-0.020234-0.0797710.011144-0.158489-0.020358-0.016036-0.144829-0.011974-0.010509-0.1680160.011145-0.132080-0.1269370.0115620.019110-0.139211-0.008475-0.0440780.087840-0.156706-0.008489-0.0416480.0106300.0165450.008475-0.008480-0.0115690.0345110.008481-0.0228970.038886-0.0087840.008472-0.0084810.008471-0.008531-0.008533-0.003868-0.044050-0.0541460.0084590.022926-0.012153-0.071554-0.120861-0.030988-0.022326-0.046674-0.012361-0.008952-0.031444-0.112856-0.047602
13459-0.172396-0.090448-0.126067-0.083162-0.085668-0.109957-0.217281-0.088728-0.182285-0.087742-0.123844-0.120632-0.234071-0.054888-0.188660-0.163757-0.128242-0.031042-0.114840-0.221879-0.212135-0.252805-0.194654-0.079903-0.078436-0.206288-0.1679670.184241-0.077002-0.023900-0.187825-0.076646-0.186352-0.150792-0.1395480.849701-0.147337-0.109180-0.130924-0.1833680.092909-0.037259-0.222978-0.104361-0.472813-0.247493-0.145101-0.243325-0.212136-0.048196-0.1571680.284709-0.227859-0.224250-0.194064-0.041249-0.132306-0.192049-0.032639-0.181799...0.011101-0.091564-0.763986-0.0712470.996515-0.014163-0.0091150.0684150.2955760.011159-0.2544260.9880530.0002080.124612-0.011954-0.0116750.5934790.0111302.3835892.4010080.011301-1.0002200.351214-0.008270-0.0517661.752778-0.611433-0.008477-0.053741-0.1134860.0278010.008361-0.008459-0.0138670.0703200.008370-0.0322610.038886-0.0087780.008361-0.0083700.008355-0.008532-0.008450-0.4728170.014996-0.3393460.0081690.0321680.0102360.103209-0.459342-0.013455-0.022908-0.067021-0.013149-0.008900-0.020502-0.563696-0.051747
134600.2202130.4284070.5390640.1298780.9309310.1521190.2619900.1675060.343409-0.036143-0.133788-0.0862660.5740801.6854210.752913-0.025926-0.0108160.0022510.3111190.6536510.5144080.162320-0.2510560.137400-0.154652-0.1480680.3057762.339665-0.077002-0.083185-0.0183810.5039220.097436-0.155138-0.158031-0.0742280.3765480.4042511.0206020.333509-0.0140591.199035-0.1461790.1904600.8702640.871184-0.0244170.7657290.514407-0.0798260.026608-0.9350870.1354550.398947-0.243537-0.125528-0.021834-0.024758-0.1141360.050575...0.011147-0.1276550.007816-0.090990-0.018944-0.014594-0.009392-0.020234-0.0797710.011144-0.158489-0.020358-0.016036-0.144829-0.011974-0.010509-0.1680160.011145-0.132080-0.1269370.0115620.019110-0.139211-0.008475-0.0440780.087840-0.156706-0.008489-0.0416480.0106300.0165450.008475-0.008480-0.0115690.0345110.008481-0.0228970.038886-0.0087840.008472-0.0084810.008471-0.008531-0.008533-0.003868-0.044050-0.0541460.0084590.022926-0.012153-0.071554-0.120861-0.030988-0.022326-0.046674-0.012361-0.008952-0.031444-0.112856-0.047602
134610.0152220.0927240.1566570.054408-0.085668-0.1045270.074379-0.1549460.042608-0.085088-0.117501-0.0862660.083504-0.131142-0.256025-0.1704330.338018-0.0743330.013288-0.0463070.001719-0.1546650.3073700.160793-0.096559-0.026013-0.139392-0.198598-0.0770020.0653490.128240-0.0766460.0297360.175793-0.115495-0.0832740.024891-0.1138920.0418580.034694-0.1131730.029863-0.146179-0.023991-0.028819-0.031130-0.140034-0.0503370.001718-0.020687-0.1677730.198412-0.0775650.225320-0.194064-0.131067-0.063616-0.128632-0.114486-0.004891...0.0110750.102449-0.414292-0.090990-0.811893-0.014594-0.0091820.034719-0.1302740.011120-0.1283490.499511-0.0145191.970255-0.011936-0.004084-0.5875650.011085-0.768414-0.8183830.0115510.806651-0.422075-0.008475-0.0388510.377146-0.372627-0.008509-0.0381060.0694890.0162630.008474-0.008479-0.0083300.0413960.008481-0.0247580.015333-0.0088140.008469-0.0084810.008468-0.008534-0.008605-0.0584180.213152-0.0378020.0084520.024738-0.018517-0.040714-0.4629180.015973-0.0255810.7707810.029360-0.0089980.032707-0.2618380.219923

13462 rows × 259 columns

Logistics Regression 调参过程

在模型中先固定参数的默认值,然后进行参数调节,进行网格搜索与专业文献查阅寻找精确度最高而又不引起模型过拟合的参数值.模型优化评价指标为 AUC 值.

在逻辑回归模型中,需要调整的参数共有 2 个:penalty 与 C, 其中 penalty 是正则化方法,C 为逻辑回归中的超参数,表示正则化强度的倒数,在模型中默认为 1,表示正则项与损失函数的比值为 1:1.当模型中的 C 越小时,会导致损失损失函数越小,从而对其惩罚更重,正则化作用越强.

1
from sklearn.linear_model import LogisticRegression
1
2
3
4
5
6
7
8
9
10
#训练逻辑回归模型  C调参范围 [0.05,0.1,0.2,0.3]
clf1 = LogisticRegression(C=0.2,penalty="l2").fit(X_train, y_train)

#返回预测属于股票代码的概率
y_pred_gbc = clf1.predict_proba(test_data2)[:,1]

#查看召回率
fpr, tpr, thresholds = metrics.roc_curve(y,y_pred_gbc,pos_label=1)
roc_auc=metrics.auc(fpr, tpr)
roc_auc
1
0.884673004897724
1
2
3
4
5
6
#penalty调参范围:[l1、l2、none]
clf2= LogisticRegression(penalty="none").fit(X_train, y_train)
y_pred_gbc = clf2.predict_proba(test_data2)[:,1]
fpr, tpr, thresholds = metrics.roc_curve(y,y_pred_gbc,pos_label=1)
roc_auc=metrics.auc(fpr, tpr)
roc_auc
1
0.9013195044655719

逻辑回归属于线性判别模型,而所处理的数据集维度较高,故可能存在其他非线性模型能够表现的更好.

SVM 调参过程

SVM 需要调整的参数有 2 个,分别为 kernal 和 C,

其中 kernal 代表核方法,可选的函数有:“poly”:多项式核函数,“rbf”:高斯核函数 (径向基函数),“linear”:线性核函数,“sigmod”:核函数.核函数在 SVM 中发挥着重要功能,在简化向量内积运算起着重要作用,其中高斯核函数在非线性分类问题上广泛应用.

C 代表错误项的惩罚系数,在软间隔分类中应用较多.C 越大,对错误样本的惩罚力度就越大,训练的样本准确率越高.但是容易产生过拟合现象,机器模型的泛化能力降低.相反,C 取较小的值时,允许训练样本中存在错误分类的样本,能够增强模型的泛化能力.

1
from sklearn import svm
1
2
3
4
5
6
#kernal调参范围: ["linear","rbf","sigmoid","poly"]   
svm1 = svm.SVC(kernel='rbf',probability=True).fit(X_train, y_train)
y_pred_gbc = svm1.predict_proba(test_data2)[:,1]
fpr, tpr, thresholds = metrics.roc_curve(y,y_pred_gbc,pos_label=1)
roc_auc=metrics.auc(fpr, tpr)
roc_auc
1
0.8988533563814463
1
2
3
4
5
6
#kernal调参范围: ["linear","rbf","sigmoid","poly"]   
svm2 = svm.SVC(C=0.003,probability=True).fit(X_train, y_train)
y_pred_gbc = svm2.predict_proba(test_data2)[:,1]
fpr, tpr, thresholds = metrics.roc_curve(y,y_pred_gbc,pos_label=1)
roc_auc=metrics.auc(fpr, tpr)
roc_auc
1
0.8741169691731491

RF 调参过程

在随机森林模型中需要调节的参数有 4 个,分别为

max_depth:树的最大深度、
n_estinators:树模型的数量、
min_samples_split:中间节点分支所需的最小样本数量、
min_sample_leaf:叶节点存在所需的最小样本数量.

为了防止模型出现过拟合现象,在调节其他参数时控制 max_depth=3.

1
from sklearn.ensemble import RandomForestClassifier
1
2
3
4
5
6
#max_depth调参范围: [3,5,7,8,11,13]  
RF1 = RandomForestClassifier(max_depth=3, random_state=0).fit(X_train, y_train)
y_pred_gbc = RF1.predict_proba(test_data2)[:,1]
fpr, tpr, thresholds = metrics.roc_curve(y,y_pred_gbc,pos_label=1)
roc_auc = metrics.auc(fpr, tpr)
roc_auc
1
0.8762738884087198
1
2
3
4
5
6
#n_estinators 调参范围: [300,400,500,600,700]
RF2 = RandomForestClassifier(n_estimators=600, random_state=0,max_depth=3).fit(X_train, y_train)
y_pred_gbc = RF2.predict_proba(test_data2)[:,1]
fpr, tpr, thresholds = metrics.roc_curve(y,y_pred_gbc,pos_label=1)
roc_auc = metrics.auc(fpr, tpr)
roc_auc
1
0.8856746374723902
1
2
3
4
5
6
#min_samples_leaf 调参范围:   [10,20,40,60,70,80,100]
RF3= RandomForestClassifier(min_samples_leaf=30, random_state=0,max_depth=3).fit(X_train, y_train)
y_pred_gbc = RF3.predict_proba(test_data2)[:,1]
fpr, tpr, thresholds = metrics.roc_curve(y,y_pred_gbc,pos_label=1)
roc_auc = metrics.auc(fpr, tpr)
roc_auc
1
0.8766320944972631
1
2
3
4
5
6
#min_samples_split 调参范围:  [60,70,80,90,110,130]
RF4 = RandomForestClassifier(min_samples_split=80, random_state=0,max_depth=3).fit(X_train, y_train)
y_pred_gbc = RF4.predict_proba(test_data2)[:,1]
fpr, tpr, thresholds = metrics.roc_curve(y,y_pred_gbc,pos_label=1)
roc_auc = metrics.auc(fpr, tpr)
roc_auc
1
0.8757802746566791

DT 调参过程

在决策树模型中需要调整的参数共有 4 个,分别为

max_depth:树的最大深度、
min_samples_split:中间节点分支所需要的的最小样本量、
min_sample_leaf:叶节点存在所需的最小样本量、
max_leaf_nodes:最大叶子节点数.

为了防止模型出现过拟合现象,调节其他参数对模型的 AUC 影响时控制 max_depth=6

1
from sklearn import tree
1
2
3
4
5
6
#max_depth 调参范围:  [5,6,7,8,9,10]
DT1 =tree.DecisionTreeClassifier(max_depth=6).fit(X_train,y_train)
y_pred_gbc = DT1.predict_proba(test_data2)[:,1]
fpr, tpr, thresholds = metrics.roc_curve(y,y_pred_gbc,pos_label=1)
roc_auc = metrics.auc(fpr, tpr)
roc_auc
1
0.9026846249879958
1
2
3
4
5
6
#min_samples_leaf 调参范围:  [2,3,6,8]
DT1 =tree.DecisionTreeClassifier(min_samples_leaf=3,max_depth=6).fit(X_train,y_train)
y_pred_gbc = DT1.predict_proba(test_data2)[:,1]
fpr, tpr, thresholds = metrics.roc_curve(y,y_pred_gbc,pos_label=1)
roc_auc = metrics.auc(fpr, tpr)
roc_auc
1
0.9008700662633248
1
2
3
4
5
6
#max_leaf_nodes 调参范围:  [50,60,70,80,100]
DT2 =tree.DecisionTreeClassifier(max_leaf_nodes=70, max_depth=6).fit(X_train,y_train)
y_pred_gbc = DT2.predict_proba(test_data2)[:,1]
fpr, tpr, thresholds = metrics.roc_curve(y,y_pred_gbc,pos_label=1)
roc_auc = metrics.auc(fpr, tpr)
roc_auc
1
0.9006472678382791
1
2
3
4
5
6
7
#为了防止模型出现过拟合现象,调节其他参数对模型的 AUC 影响时控制 max_depth=6
#min_samples_split 调参范围: [2,3,4,5,6,8]
DT3 =tree.DecisionTreeClassifier(min_samples_split=3,max_depth=6).fit(X_train,y_train)
y_pred_gbc = DT3.predict_proba(test_data2)[:,1]
fpr, tpr, thresholds = metrics.roc_curve(y,y_pred_gbc,pos_label=1)
roc_auc = metrics.auc(fpr, tpr)
roc_auc
1
0.9005358686257564

筛选在DT算法中特征重要性系数前20个指标

1
2
3
4
5
DT_importances = DT3.feature_importances_*10000
DT = pd.Series(DT_importances, index = a2.columns)
DT = DT.sort_values(ascending=False)
DT = pd.DataFrame({'feature_importances' : DT})
DT.head(20)
feature_importances
DILUTED_EPS1876.459630
ESTIMATED_LIAB1418.956658
RETAINED_EARNINGS1274.737100
ASSETS_DISP_GAIN1189.341616
C_FR_CAP_CONTR350.146297
CASH_C_EQUIV305.869870
DEFER_TAX_LIAB305.319042
INT_RECEIV278.811896
CURRENT_RATIO254.761597
N_CF_FR_FINAN_A245.185566
OTH_GAIN214.434895
GOODWILL207.893979
N_INCOME199.309895
CL_TA190.726453
NOPERATE_EXP180.024039
OTH_CL167.889986
GAIN_INVEST160.657452
DIV_PAYABLE157.439514
IT_TR117.051125
A_J_INVEST_INCOME101.434231

XGBoost 调参过程

XGBoost 需要调节的参数共有 9 个,下面只介绍对该模型相对重要的两个参数:

第一个参数是 n_estimators,在 XGBoost 模型中这个参数发挥着重要作用,表示该模型中分类器的个数,该参数的值越大,模型的学习能力就会越强.

第二个参数是learning_rate,learning_rate 表示集成模型中的学习速率,又被称之为步长控制迭代速率,有效的调节该参数值能够防止模型出现过拟合现象,默认值为 0.1,调节范围为[0,1].

为了尽可能防止模型出现过拟合,在调节其他参数的值时将学习率设定为0.001.

1
2
from xgboost import XGBClassifier
from xgboost import plot_importance
1
2
3
4
5
6
#learning_rate 调参范围:  [0.001,0.002,0.003,0.0035]
XGBoost1 = XGBClassifier(learning_rate=0.001).fit(X_train,y_train)
y_pred_gbc = XGBoost1.predict_proba(test_data2)[:,1]
fpr, tpr, thresholds = metrics.roc_curve(y,y_pred_gbc,pos_label=1)
roc_auc = metrics.auc(fpr, tpr)
roc_auc
1
0.9089825218476904
1
2
3
4
5
6
7
#为了防止模型出现过拟合现象,调节其他参数对模型的 AUC 影响时控制 learning_rate=0.001
#learning_rate 调参范围: [100,110,120,200,300]
XGBoost2 = XGBClassifier(n_estimators=120,learning_rate=0.001).fit(X_train,y_train)
y_pred_gbc = XGBoost2.predict_proba(test_data2)[:,1]
fpr, tpr, thresholds = metrics.roc_curve(y,y_pred_gbc,pos_label=1)
roc_auc = metrics.auc(fpr, tpr)
roc_auc
1
0.911076058772688
1
2
3
4
5
6
7
#为了防止模型出现过拟合现象,调节其他参数对模型的 AUC 影响时控制 learning_rate=1
#max_depth 调参范围: [2,3,5,6,7,10]
XGBoost3 = XGBClassifier(max_depth=6,learning_rate=0.001).fit(X_train,y_train)
y_pred_gbc = XGBoost3.predict_proba(test_data2)[:,1]
fpr, tpr, thresholds = metrics.roc_curve(y,y_pred_gbc,pos_label=1)
roc_auc = metrics.auc(fpr, tpr)
roc_auc
1
0.9089825218476904
1
2
3
4
5
6
7
#为了防止模型出现过拟合现象,调节其他参数对模型的 AUC 影响时控制 learning_rate=1
#min_child_weight 调参范围: [1,3,4,5,7,8]
XGBoost4 = XGBClassifier(min_child_weight=3,learning_rate=0.001).fit(X_train,y_train)
y_pred_gbc = XGBoost4.predict_proba(test_data2)[:,1]
fpr, tpr, thresholds = metrics.roc_curve(y,y_pred_gbc,pos_label=1)
roc_auc = metrics.auc(fpr, tpr)
roc_auc
1
0.9118438490348603
1
2
3
4
5
6
7
#为了防止模型出现过拟合现象,调节其他参数对模型的 AUC 影响时控制 learning_rate=1
#Gamma 调参范围: [0.2,0.3,0.5,0.6,0.7,0.8]
XGBoost5 = XGBClassifier(gamma=0.4,learning_rate=0.001).fit(X_train,y_train)
y_pred_gbc = XGBoost5.predict_proba(test_data2)[:,1]
fpr, tpr, thresholds = metrics.roc_curve(y,y_pred_gbc,pos_label=1)
roc_auc = metrics.auc(fpr, tpr)
roc_auc
1
0.9089844425237683
1
2
3
4
5
6
7
#为了防止模型出现过拟合现象,调节其他参数对模型的 AUC 影响时控制 learning_rate=1
#Colsample_btree 调参范围: [0.6,0.7,0.8,0.85,0.9]
XGBoost7 = XGBClassifier(colsample_btree=0.85,learning_rate=0.001).fit(X_train,y_train)
y_pred_gbc = XGBoost7.predict_proba(test_data2)[:,1]
fpr, tpr, thresholds = metrics.roc_curve(y,y_pred_gbc,pos_label=1)
roc_auc = metrics.auc(fpr, tpr)
roc_auc
1
0.9089825218476904
1
2
3
4
5
6
7
#为了防止模型出现过拟合现象,调节其他参数对模型的 AUC 影响时控制 learning_rate=1
#reg_alpha 调参范围: [0.1,0.2,0.25,0.3,0.35]
XGBoost8 = XGBClassifier(reg_alpha=0.2,learning_rate=0.001).fit(X_train,y_train)
y_pred_gbc = XGBoost8.predict_proba(test_data2)[:,1]
fpr, tpr, thresholds = metrics.roc_curve(y,y_pred_gbc,pos_label=1)
roc_auc = metrics.auc(fpr, tpr)
roc_auc
1
0.9087702871410737
1
2
3
4
5
6
7
#为了防止模型出现过拟合现象,调节其他参数对模型的 AUC 影响时控制 learning_rate=1
#reg_lambda 调参范围: [0.15,0.3,0.5,0.8]
XGBoost9 = XGBClassifier(reg_lambda=0.3,learning_rate=0.001).fit(X_train,y_train)
y_pred_gbc = XGBoost9.predict_proba(test_data2)[:,1]
fpr, tpr, thresholds = metrics.roc_curve(y,y_pred_gbc,pos_label=1)
roc_auc = metrics.auc(fpr, tpr)
roc_auc
1
0.9160871026601364

筛选在XGBoost算法中特征重要性系数前20个指标

1
2
3
4
5
XGBoost_importances = XGBoost9.feature_importances_*10000
XGBoost = pd.Series(XGBoost_importances, index = a2.columns)
XGBoost = XGBoost.sort_values(ascending=False)
XGBoost = pd.DataFrame({'feature_importances' : XGBoost})
XGBoost.head(20)
feature_importances
DILUTED_EPS1092.639893
ASSETS_DISP_GAIN839.089539
RETAINED_EARNINGS501.631378
T_CA428.929138
ESTIMATED_LIAB363.835968
DEFER_TAX_LIAB311.677368
CURRENT_RATIO294.901703
N_CF_FR_FINAN_A284.164001
GOODWILL239.289185
N_INCOME238.080658
CL_TA224.853226
INVENTORIES224.492996
ROE_A214.091843
NOPERATE_EXP196.429474
OTH_CA192.453537
OTH_CL191.517349
C_FR_MINO_S_SUBS191.478973
GAIN_INVEST187.904297
C_INF_FR_INVEST_A180.430847
CASH_C_EQUIV178.665329

GBM 调参过程

该模型需要添加的参数共有 7 个,选取了对该模型相对重要的几个参数进行调节.

第一个参数是:max_depth:模型中树的最大深度.

第二个参数是 n_estimators:模型中分类器的数量,该参数在模型中的作用较为强大,可以有效的提升模型的学习能力.

第三个参数是 learning_rate:学习率,该参数的有效调节对模型是否会过拟合发挥着重要作用,参数的取值范围为 [0,1],默认值为 0.1.为了能够有效的提升模型的泛化能力并且防止模型出现过拟合现象,经过网络搜索法并且查阅大量机器学习专业文献将 learning 设置为 0.0088.

1
from sklearn.ensemble import GradientBoostingClassifier
1
2
3
4
5
6
#learning_rate 调参范围:   [0.004,0.007,0.0076,0.0088,0.009]
GBM1 = GradientBoostingClassifier(learning_rate=0.0088).fit(X_train,y_train)
y_pred_gbc = GBM1.predict_proba(test_data2)[:,1]
fpr, tpr, thresholds = metrics.roc_curve(y,y_pred_gbc,pos_label=1)
roc_auc = metrics.auc(fpr, tpr)
roc_auc
1
0.9016575434552964
1
2
3
4
5
6
#n_estimators 调参范围:  [110,120,130,140,160]
GBM2 = GradientBoostingClassifier(n_estimators=130,learning_rate=0.0088).fit(X_train,y_train)
y_pred_gbc = GBM2.predict_proba(test_data2)[:,1]
fpr, tpr, thresholds = metrics.roc_curve(y,y_pred_gbc,pos_label=1)
roc_auc = metrics.auc(fpr, tpr)
roc_auc
1
0.9171185057140113
1
2
3
4
5
6
#Subsample 调参范围: [0.1,0.2,0.25,0.3,0.4]
GBM3 = GradientBoostingClassifier(subsample=0.3,learning_rate=0.0088).fit(X_train,y_train)
y_pred_gbc = GBM3.predict_proba(test_data2)[:,1]
fpr, tpr, thresholds = metrics.roc_curve(y,y_pred_gbc,pos_label=1)
roc_auc = metrics.auc(fpr, tpr)
roc_auc
1
0.9131902429655239
1
2
3
4
5
6
#min_samples_split 调参范围: [2,3,4,5,6] 
GBM4 = GradientBoostingClassifier(min_samples_split=4,learning_rate=0.0088).fit(X_train,y_train)
y_pred_gbc = GBM4.predict_proba(test_data2)[:,1]
fpr, tpr, thresholds = metrics.roc_curve(y,y_pred_gbc,pos_label=1)
roc_auc = metrics.auc(fpr, tpr)
roc_auc
1
0.9016575434552964
1
2
3
4
5
6
#mmin_samples_leaf 调参范围: [2,3,4,6,7,9] 
GBM5 = GradientBoostingClassifier(min_samples_leaf=3,learning_rate=0.0088).fit(X_train,y_train)
y_pred_gbc = GBM5.predict_proba(test_data2)[:,1]
fpr, tpr, thresholds = metrics.roc_curve(y,y_pred_gbc,pos_label=1)
roc_auc = metrics.auc(fpr, tpr)
roc_auc
1
0.9021089023336215
1
2
3
4
5
6
#max_depth 调参范围: [2,3,4,5,8]
GBM6 = GradientBoostingClassifier(max_depth=3,learning_rate=0.0088).fit(X_train,y_train)
y_pred_gbc = GBM6.predict_proba(test_data2)[:,1]
fpr, tpr, thresholds = metrics.roc_curve(y,y_pred_gbc,pos_label=1)
roc_auc = metrics.auc(fpr, tpr)
roc_auc
1
0.9016575434552964
1
2
3
4
5
6
#validation_fraction 调参范围: [0.1,0.3,0.4,0.5,0.7,0.8]
GBM7 =GradientBoostingClassifier(validation_fraction=0.1,learning_rate=0.0088).fit(X_train,y_train)
y_pred_gbc = GBM7.predict_proba(test_data2)[:,1]
fpr, tpr, thresholds = metrics.roc_curve(y,y_pred_gbc,pos_label=1)
roc_auc = metrics.auc(fpr, tpr)
roc_auc
1
0.9016575434552964

筛选在GBM算法中特征重要性系数前20个指标

1
2
3
4
5
GBM_importances = GBM2.feature_importances_*10000
GBM = pd.Series(GBM_importances, index = a2.columns)
GBM = GBM.sort_values(ascending=False)
GBM = pd.DataFrame({'feature_importances' : GBM})
GBM.head(20)
feature_importances
DILUTED_EPS2545.405622
ASSETS_DISP_GAIN1393.862699
RETAINED_EARNINGS1201.324132
ESTIMATED_LIAB1123.032285
NCA_DISPLOSS496.913323
C_FR_CAP_CONTR429.072252
OTH_GAIN392.266813
NOPERATE_EXP216.007424
PROC_SELL_INVEST191.362051
T_CA179.957164
DEFER_TAX_LIAB172.171302
N_CF_FR_INVEST_A157.682156
INT_PAYABLE157.357831
DIV_PAYABLE103.399260
C_PAID_OTH_FINAN_A95.313147
INT_RECEIV87.237988
REV_PS86.175900
C_INF_FR_INVEST_A76.066347
ADVANCE_RECEIPTS64.396530
T_EQUITY_ATTR_P62.686717