ner

file tree

myconfig.py | to config the prj parameter train.py | for train eval.py | for evaluation data_generater.py | to generation the data, from txt file

10 directories, 75 files

本文件为实体提取，测试标签为手机号码

债券违约风险预测使用评级及舆情预测债券违约风险

目录树 . ├── apri_data.txt ├── aprio算法分析频繁量级.txt ├── bond-analysis.ipynb ├── bond-risk.ipynb ├── bond_risk_qq.ipynb ├── code.txt ├── data_ ├── data_helper.py ├── data_helper.sublime-project ├── data_helper.sublime-workspace ├── _df10.csv ├── _df5_2kk.csv ├── _df5_400k.csv ├── _df5.csv ├── _df6.csv ├── _df8_2.csv ├── _df8.csv ├── _df8_feed.csv ├── df.csv ├── _df_dat.csv ├── f.csv ├── file_tree.md ├── final_datas.csv ├── fp_growth.py ├── fp_growth.py.bak ├── ipython_log.py ├── k_means_ana.py ├── k_means.py ├── pycache │ ├── data_helper.cpython-36.pyc │ ├── fp_growth.cpython-36.pyc │ ├── k_means.cpython-36.pyc │ └── train.cpython-36.pyc ├── README.md ├── reallytest.md ├── risk_boom.txt ├── senti_vec_means_shift.png ├── shujuzhili.md ├── train_data.csv ├── train.py ├── tree.txt

18-01-02 test record

start-date	end-date	TPPP	TPPN	TNPP	TNPN
90	180	0	3	1	322
30	90	0	7	1	318
60	180	0	2	1	323
0	180	1	0	0	326
0	90	0	2	5	319
0	30	0	3	0	323

data bug 1 time format 2 repeat 3

[('cluster', 6), ('clusterAssment', 48), ('time', 56), ('Label', 81), ('V_Time', 104), ('V3', 115), ('V5', 196), ('V1', 237), ('V11', 253), ('V12', 260), ('V6', 296), ('percent', 309), ('V8', 320), ('V30', 321), ('V20', 325), ('ID', 327), ('V23', 332), ('V15', 335), ('V18', 342), ('V25', 364), ('V16', 372), ('V22', 391), ('V2', 435), ('V10', 468), ('V21', 476), ('V26', 485), ('V29', 510), ('V24', 515), ('V17', 536), ('V19', 542), ('V9', 569), ('V4', 602), ('V14', 603), ('V27', 608), ('V7', 622), ('V28', 651), ('V13', 737)]

18-01-17

logregobj {'sta': 0.0, 'end': 1.0, 'max': 0.9032258064516129, 'judge': 0.6666666666666666, 'precision': 1.0, 'recall': 0.8235294117647058}

half_pass_hess {'sta': 0.0, 'end': 1.0, 'max': 0.9032258064516129, 'judge': 0.6666666666666666, 'precision': 1.0, 'recall': 0.8235294117647058}

filter_half_zero {'sta': 0.0, 'end': 1.0, 'max': 0.9032258064516129, 'judge': 0.6666666666666666, 'precision': 1.0, 'recall': 0.8235294117647058}

债券违约风险,precision 要求尽可能高 recall要求100%
天池大数据竞赛使用修正后的损失函数发现胶着部分的分类
时间序列项目推进
图片识别

三个未能够识别出来的企业名称

中科云网科技集团股份有限公司,24.0,57.0,24.0,33.0,-33.0,2831.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,, ,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,4.0,,,,4.0,,,,,,,,,,,,,,,,,8.0,4.0,8.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,8.0,,,,,,,,,,,4.0,,,,,,,,,,8.0,,,,,,,,4.0,,,,12.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,8.0,,,,,,,,,,,,,,,,,,,,,,,4.0,,4.0,,24.0,,,,,,4.0,,,32.0,,,,,,,,8.0,,,,,,,,,4.0,4.0,,,4.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,珠海中富实业股份有限公司,2015-05-22,1.0,-1.0,-1.0,-1.0,珠海中富实业股份有限公司,72.0,136.0,54.0,64.0,-82.0,3218.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,, ,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,36.0,18.0,,,,,,,,,,,,,,,,,,,,,,,,,,,18.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,18.0,,,,,,,,,,,,,,,,,,,,,,18.0,,,,,,18.0,,,,,,,,,,,,,,,,18.0,,,,,,,,,,,,,,,,,,,,,,18.0,,,18.0,,,,,,,,,,,,,,,,18.0,,,,,,,,,,,18.0,,,,,,,,,,,,,,36.0,,,,,,,,,,,,,,,,,,城市建设控股集团有限公司,2016-09-30,1.0,-1.0,-1.0,-1.0,城市建设控股集团有限公司,72.0,324.0,,252.0,,3490.0

2018-01-15 至 2018-01-19

债务违约项目

增加舆情趋势后,f-score增加明显,对与未检的样本,进行分析分别为:

珠海中富实业股份
**城市建设控股集团
中科云网科技集团股份有限公司考虑到f1_score 权重中 180日内舆情,及120-60日内舆情,可占比重50%,排查舆情数据,发现舆情数据,数量略低于其他违约企业,与常规企业无差别,故,需增加新的数据维度,进行进一步的区分

后续对策:

增加财务数据,在现有基础上提高召回率

天池数据竞赛盐城车牌上牌量时间序列预测

将序列数据转成,平稳序列,使用ARIMA 模型对时间序列进行分析,求出acf,pacf,差分等参数
配置R语言环境,便于后续进行时间序列分析
使用tushare导出到处600家A股企业,财务数据,包括江苏主题,汽车主题,随机抽取100股,进行关联分析
将5类品牌汽车,分类后,识别日期数据并导入

思路及对策整理:

经过前期整理,数据反映出如下特征:

明显的季节周期性
明显的月度周期性
明显的周周期性
一定的年度周期性
一定的不确定性

预测中需要解决的问题,包括三点

周期性数据的预测

尽可能分离参数,使用小波,傅里叶,卡尔曼,均线等方法,分离出尽可能多的周期成分数据
针对不同车型,分别分析
使用arima rnn等成熟方法对周期性数据进行处理

非周期性数据的预测

寻找场外数据,将场外数据,和已经可以处理的场内周期数据进行,与场内非周期数据进行关联分析, 相似度分析.

调参数及时间分配
使用xgboost对不同模型的结果进行集成提升

天池数据竞赛风险交易识别

调整损失函数和评价函数
- 修改损失函数,随机丢弃一半的正样本
- 修改损失函数,只反映0.5 正负0.2范围内的交叉熵值
- 经过训练,找出与前次预测的一个不同,经验整,结果错误

思路整理: 该类问题,处理流程:

PCA分析如果主成分比例在一维权重较大,可初步认为是普通风险模型,其阳性样本曾'球'状态,k聚类可以达到较好的效果
k聚类分类数目在(总样本/副样本)比例的1~10倍,可达到较好效果
初步验证后,将验证集导入训练集增加数据量
增加频繁项集以上对结果均有提升,已经验证;
通过,丢失部分阴性样本,修改损失函数,让模型损失整体的表达能力,提升,特定logistic回归区域内的特征,对于准确率提高明显,可将模型第一次分类后的结果,分为几个区间,进行训练验证, 本条待验证.
使用2层全连接层面,训练70000次,准确率95% fscore recall 较低,提高训练次数,调整网络结构增加dropout,可优化. 以上两条,时间有限,尚未验证

配置 VIM环境

定时启动文件 data_helper.py 将结果保存至于数据库 id compname date label

INSERT INTO resultTABLE VALUES("%s",CURTIME(),"%s")%(compname, lb);

与模型契合的数据项有159条

与模型契合的风险标签名有159条

与模型契合的企业名有159条

满足模型条件的有7家

周工作总结

债券违约预警

整理中间表数据，做成模型可以处理的数据格式
目前，中间表中，有企业1193家，标签159项
经过整理，发现在180天，120天，60天，均有超过3次标签记录，的企业有7家，经模型预测,均未发生违约
尝试使用 CNN LSTM,重新训练模型;(进行中) 110项目
提取手机号银行卡号身份证微信号 QQ号等信息
使用bilstm+verbite 实现基于某一类文本的分词，未为能达到理想效果

贵阳全量接口 http://192.168.106.70:7950/guizhou/all/predict
贵阳作案手段接口 http://192.168.106.70:7950/guizhou/method/predict
北京110全量接口 http://192.168.106.70:7950/beijing/predict/110/all
北京110案情类别分类 http://192.168.106.70:7950/beijing/predict/110/caseClassify

| 北京 | 电话号提取 | /beijing/predict/110/phoneNum | 7911 | | 北京 | 身份证提取 | /beijing/predict/110/identifier | 7911 | | 北京 | 微信号提取 | /beijing/predict/110/weixin | 7911 | | 北京 | 微博号提取 | /beijing/predict/110/weibo | 7911 | | 北京 | QQ号提取 | /beijing/predict/110/qq | 7911 | | 北京 | 车辆信息提取 | /beijing/predict/110/carinfo | 7911 | | 北京 | 信用卡提取 | /beijing/predict/110/creditCard | 7911 |

请求

{
  "messageid": "111111111",
  "clientid": "22222",
  "encrypt": false, # 可以为boolean类型，也可以为"false"或"true"的str类型
  "text": [
    "案情1详情",
    "案情2详情",
    ...
  ]
}

测试，号码提取工作

/home/siyuan/ner
├── bilstm.zip
└── bilstm
    ├── README.md
    ├── rar
    │   ├── whatsnew.txt
    │   ├── unrar
    │   ├── technote.txt
    │   ├── readme.txt
    │   ├── rar.txt
    │   ├── rar_static
    │   ├── rarlinux-3.8.0.tar.gz
    │   ├── rarfiles.lst
    │   ├── rar
    │   │   ├── whatsnew.txt
    │   │   ├── unrar
    │   │   ├── technote.txt
    │   │   ├── readme.txt
    │   │   ├── rar.txt
    │   │   ├── rar_static
    │   │   ├── rarfiles.lst
    │   │   ├── rar
    │   │   ├── order.htm
    │   │   ├── Makefile
    │   │   ├── license.txt
    │   │   ├── file_id.diz
    │   │   ├── default.sfx
    │   │   └── CentOS-Base-sohu.repo
    │   ├── order.htm
    │   ├── Makefile
    │   ├── license.txt
    │   ├── file_id.diz
    │   └── default.sfx
    ├── __pycache__
    │   ├── timer.cpython-36.pyc
    │   ├── sql_helper.cpython-36.pyc
    │   ├── digital_extract_classfier_regre.cpython-36.pyc
    │   └── data_helper.cpython-36.pyc
    ├── models
    │   └── gensim_word2vec.model
    ├── __init__.py
    ├── doc
    │   └── README.md
    ├── data_helper.py>>
    ├── code
    │   ├── wx_extract.py
    │   ├── train.py
    │   ├── train2.py
    │   ├── timer.py
    │   ├── test_zy.py
    │   ├── sql_helper.py
    │   ├── __pycache__
    │   │   ├── sql_helper.cpython-36.pyc
    │   │   ├── myconfig.cpython-36.pyc
    │   │   ├── myconfig.cpython-35.pyc
    │   │   ├── digital_extract_classfier_regre_mul.cpython-36.pyc
    │   │   ├── data_helper.cpython-36.pyc
    │   │   ├── data_helper.cpython-35.pyc
    │   │   ├── data_generator.cpython-36.pyc
    │   │   └── config.cpython-36.pyc
    │   ├── pkltest.py
    │   ├── pickle_test.py
    │   ├── num_append.json
    │   ├── nodes.txt
    │   ├── myconfig.py
    │   ├── model.py
    │   ├── log.txt
    │   ├── __init__.py
    │   ├── gensim_word2vec.py
    │   ├── ext_dic.json
    │   ├── eval.py
    │   ├── eval_model.py
    │   ├── eval_extract.py
    │   ├── eval2.py
    │   ├── envs.json
    │   ├── digital_extract_classfier_regre.py
    │   ├── digital_extract_classfier_regre_mul.py
    │   ├── data_helper.py
    │   ├── data_generator.py
    │   ├── data
    │   ├── config.py
    │   ├── checkpoint
    │   ├── biLstm_base.py
    │   └── auto_encoder.py
    └── ckpt
        └── *

10 directories, 75 files

scmsqhn / bilstm-att Goto Github PK

bilstm-att's Introduction

ner

file tree

本文件为实体提取，测试标签为手机号码

18-01-17

债务违约项目

天池数据竞赛盐城车牌上牌量时间序列预测

天池数据竞赛风险交易识别

配置 VIM环境

与模型契合的数据项有159条

周工作总结

bilstm-att's People

Watchers

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs

start-date	end-date	TPPP	TPPN	TNPP	TNPN
90	180	0	3	1	322
30	90	0	7	1	318
60	180	0	2	1	323
0	180	1	0	0	326
0	90	0	2	5	319
0	30	0	3	0	323

start-date	end-date	TPPP	TPPN	TNPP	TNPN
90	180	0	3	1	322
30	90	0	7	1	318
60	180	0	2	1	323
0	180	1	0	0	326
0	90	0	2	5	319
0	30	0	3	0	323

scmsqhn / bilstm-att Goto Github PK

bilstm-att's Introduction

ner

file tree

本文件为实体提取，测试标签为手机号码

18-01-17

债务违约项目

天池数据竞赛 盐城车牌上牌量 时间序列预测

天池数据竞赛 风险交易识别

配置 VIM环境

与模型契合的数据项有159条

周工作总结

bilstm-att's People

Watchers

Recommend Projects

Recommend Topics

Recommend Org

Jobs

天池数据竞赛盐城车牌上牌量时间序列预测

天池数据竞赛风险交易识别

start-date	end-date	TPPP	TPPN	TNPP	TNPN
90	180	0	3	1	322
30	90	0	7	1	318
60	180	0	2	1	323
0	180	1	0	0	326
0	90	0	2	5	319
0	30	0	3	0	323