alanshi / charset_mnbvc Goto Github PK
View Code? Open in Web Editor NEW本项目旨在对大量文本文件进行快速编码检测和转换以辅助mnbvc语料集项目的数据清洗工作
Home Page: https://pypi.org/project/charset-mnbvc/
License: MIT License
本项目旨在对大量文本文件进行快速编码检测和转换以辅助mnbvc语料集项目的数据清洗工作
Home Page: https://pypi.org/project/charset-mnbvc/
License: MIT License
实践中有很多一部分原始数据比如游戏文本采用yml
或者json
形式给出,而convert_files脚本在不修改源码的情况下很难从命令行参数去包含这些文件。命令行的输入-i
给的一个目录,保证递归目录下都是相同形式的文件是相对容易的,但保证扩展名都是.txt
是比较困难的。
common_utils.py:4
中get_file_paths
函数写死了扩展名是.txt
,api.py:113
中scan_dir
函数写死了扩展名是.txt
,看下这里能不能接受其它传参?
在提交给MNBVC的语料中,我们采用的是.jsonl
后缀,这种类型的文件需要一种统一的编码校验(放行规则)或者校正。
举个例子,没使用ensure_ascii=False情况下存出来的json形如这样:
{"\u662f\u5426\u5f85\u67e5\u6587\u4ef6": false, "\u662f\u5426\u91cd\u590d\u6587\u4ef6": false, "\u6bb5\u843d\u6570": 17944, "\u53bb\u91cd\u6bb5\u843d\u6570": 0, "\u4f4e\u8d28\u91cf\u6bb5\u843d\u6570": 0}
而我们希望转换后的jsonl能够不做unicode转义,形如这样:
{"是否待查文件": false, "是否重复文件": false, "段落数": 17944, "去重段落数": 0, "低质量段落数": 0}
麻烦看下这个最后的转换工作是不是应该放在convert_files.py这个脚本里面做?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.