Comments (12)
@icoxfog417 me too
from sumeval.
I want to support the Chinese! The spaCy can tokenize Chinese, so we think we can support Chinese like English manner.
And we need the following material to test behavior.
- Stop word/Stemming word list.
- regex to replace symbols.
@micxyj , @policeme Do you have any resource that would be helpful to implement the above two?
from sumeval.
I just have vocab stop list, Is the regex, just regex chinese or other word?
from sumeval.
-
For Chinese stop words, you can refer this link: https://github.com/micxyj/extract_keywords/blob/master/stopwords.txt
-
For regex, do you mean that you will replace Chinese punctuation like ",!。", etc?
from sumeval.
Thank you for offering stop word list!
For regex, do you mean that you will replace Chinese punctuation like ",!。", etc?
Yes. We have to exclude characters that have to be ignored when calculating ROUGE/BLEU.
from sumeval.
import re
example = "我:》?喜。,欢!@打~···篮「」球【】"
reg = re.compile(r'[\s+\.\!\/_,$%^*(+\“\’\”\'\"]+|[+——!,。:?、~@#¥%……&*()【】「」《》·]+')
result = re.sub(reg, '', example)
print(result)
Can it works?
from sumeval.
Now, it support chinese? @icoxfog417
from sumeval.
I'm now tackling. Please wait a little.
from sumeval.
ok, thx your working!
from sumeval.
I implemented Chinese support. And now we need test summaries to test it.
Making test summaries requires careful work. It is needed to adjust the number of tokens.
- number of tokens of summaries.
- number of tokens of references.
- number of match tokens between summaries and references.
Here is Japanese sample (Each tokens is splited by space). So it will take a little time.
from sumeval.
Add test case by referring Regularizing Output Distribution of Abstractive Chinese Social Media Text Summarization for Improved Semantic Consistency.
from sumeval.
Merged #10. Please open the issue if you have some trouble with Chinese!
from sumeval.
Related Issues (18)
- `From commandline` command error HOT 1
- Move Japanese parser from Cabocha to GiNZA
- Implements CI
- how to load self jieba HOT 4
- Are the references single sentences or summaries with many sentences? HOT 1
- What exactly does rouge_be measure? HOT 1
- Ginza v2.0への対応 HOT 1
- How to calculate ROUGE-SU4
- ROUGE score is not matched with Pythonrouge when stemming=True HOT 1
- ROUGE-L for summary-level
- Please add CITATION.cff for making citation easier
- encounter an OSError HOT 1
- super() takes at least 1 argument(0 given) HOT 1
- Cannot use stemming in python3.6 HOT 2
- Input parameter types not clearly documented HOT 2
- Supporting multiple sentences for the calculation of ROUGE-L? HOT 3
- broken with recent version of sacrebleu HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from sumeval.