GithubHelp home page GithubHelp logo

Support for chinese? about sumeval HOT 12 CLOSED

chakki-works avatar chakki-works commented on May 25, 2024
Support for chinese?

from sumeval.

Comments (12)

xiongma avatar xiongma commented on May 25, 2024

@icoxfog417 me too

from sumeval.

icoxfog417 avatar icoxfog417 commented on May 25, 2024

I want to support the Chinese! The spaCy can tokenize Chinese, so we think we can support Chinese like English manner.

And we need the following material to test behavior.

@micxyj , @policeme Do you have any resource that would be helpful to implement the above two?

from sumeval.

xiongma avatar xiongma commented on May 25, 2024

I just have vocab stop list, Is the regex, just regex chinese or other word?

from sumeval.

micxyj avatar micxyj commented on May 25, 2024

from sumeval.

icoxfog417 avatar icoxfog417 commented on May 25, 2024

Thank you for offering stop word list!

For regex, do you mean that you will replace Chinese punctuation like ",!。", etc?

Yes. We have to exclude characters that have to be ignored when calculating ROUGE/BLEU.

from sumeval.

micxyj avatar micxyj commented on May 25, 2024
import re

example = "我:》?喜。,欢!@打~···篮「」球【】"
reg = re.compile(r'[\s+\.\!\/_,$%^*(+\“\’\”\'\"]+|[+——!,。:?、~@#¥%……&*()【】「」《》·]+')
result = re.sub(reg, '', example)
print(result)

Can it works?

from sumeval.

xiongma avatar xiongma commented on May 25, 2024

Now, it support chinese? @icoxfog417

from sumeval.

icoxfog417 avatar icoxfog417 commented on May 25, 2024

I'm now tackling. Please wait a little.

from sumeval.

xiongma avatar xiongma commented on May 25, 2024

ok, thx your working!

from sumeval.

icoxfog417 avatar icoxfog417 commented on May 25, 2024

I implemented Chinese support. And now we need test summaries to test it.
Making test summaries requires careful work. It is needed to adjust the number of tokens.

  • number of tokens of summaries.
  • number of tokens of references.
  • number of match tokens between summaries and references.

Here is Japanese sample (Each tokens is splited by space). So it will take a little time.

from sumeval.

icoxfog417 avatar icoxfog417 commented on May 25, 2024

Add test case by referring Regularizing Output Distribution of Abstractive Chinese Social Media Text Summarization for Improved Semantic Consistency.

from sumeval.

icoxfog417 avatar icoxfog417 commented on May 25, 2024

Merged #10. Please open the issue if you have some trouble with Chinese!

from sumeval.

Related Issues (18)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.