GithubHelp home page GithubHelp logo

alanshi / charset_mnbvc Goto Github PK

View Code? Open in Web Editor NEW
47.0 47.0 11.0 6.76 MB

本项目旨在对大量文本文件进行快速编码检测和转换以辅助mnbvc语料集项目的数据清洗工作

Home Page: https://pypi.org/project/charset-mnbvc/

License: MIT License

Python 100.00%

charset_mnbvc's Introduction

Hi there 👋

alanshi's GitHub stats-Light

charset_mnbvc's People

Contributors

alanshi avatar janson91 avatar larryisthere avatar lingeoan avatar pomelo avatar prnake avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

charset_mnbvc's Issues

[feat] 希望convert_files脚本能够支持json和jsonl文件或任意文本文件的编码校验

关于txt后缀

实践中有很多一部分原始数据比如游戏文本采用yml或者json形式给出,而convert_files脚本在不修改源码的情况下很难从命令行参数去包含这些文件。命令行的输入-i给的一个目录,保证递归目录下都是相同形式的文件是相对容易的,但保证扩展名都是.txt是比较困难的。

common_utils.py:4get_file_paths函数写死了扩展名是.txtapi.py:113scan_dir函数写死了扩展名是.txt,看下这里能不能接受其它传参?

关于键值类型文件

在提交给MNBVC的语料中,我们采用的是.jsonl后缀,这种类型的文件需要一种统一的编码校验(放行规则)或者校正。

举个例子,没使用ensure_ascii=False情况下存出来的json形如这样:

{"\u662f\u5426\u5f85\u67e5\u6587\u4ef6": false, "\u662f\u5426\u91cd\u590d\u6587\u4ef6": false, "\u6bb5\u843d\u6570": 17944, "\u53bb\u91cd\u6bb5\u843d\u6570": 0, "\u4f4e\u8d28\u91cf\u6bb5\u843d\u6570": 0}

而我们希望转换后的jsonl能够不做unicode转义,形如这样:

{"是否待查文件": false, "是否重复文件": false, "段落数": 17944, "去重段落数": 0, "低质量段落数": 0}

麻烦看下这个最后的转换工作是不是应该放在convert_files.py这个脚本里面做?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.