phuonglh / vn.vitk Goto Github PK

A Vietnamese Text Processing Toolkit

License: GNU General Public License v3.0

HTML 0.03% Java 99.97%

vn.vitk's Introduction

Vitk -- A Vietnamese Text Processing Toolkit

NOTE: This repos is now obsolete. Interested programmers should consider to use the new repo vlp (github.com/phuonglh/vlp) We have preferred using Scala instead of Java since 2016.

NOTE: Early 2018, a new, updated and lightweight toolkit, which does not use Apache Spark is available at vnTokenizer and vnTagger, and its online demo website.

This is the third release of a Vietnamese text processing toolkit, which is called "Vitk", developed by Phuong LE-HONG at College of Science, Vietnam National University in Hanoi.

There are some toolkits for Vietnamese text processing which are already published. However, most of them are not readily scalable for large data processing. This toolkit aims at the ability of processing big text data. For this reason, it uses Apache Spark as its core platform. Apache Spark is a fast and general engine for large scale data processing. Therefore, Vitk is a fast cluster computing toolkit.

If you don't want to use Apache Spark, you should download and use a standalone Vietnamese tokenizer or tagger from their website, only one JAR file is needed to run the program. vnTokenizer 5.1 and vnTagger

Despite of its name, this toolkit supports processing in various natural languages providing that suitable underlying models or linguistic resources are available for the different languages. The toolkit is packaged with models and resources for processing Vietnamese. The users can build models for other languages using the underlying tools.

Some examples:

The word segmentation tool of Vitk can tokenize a text of two million Vietnamese syllables in 20 seconds on a cluster of three computers (24 cores, 24 GB RAM), giving an accuracy of about 97%.
The part-of-speech tagger of Vitk can tag about 1,105,000 tokens per second, on a single machine, giving an accuracy of about 95% on the Vietnamese treebank.
The dependency parser of Vitk parses 12,543 sentences (204,586 tokens) of the English universal dependency treebank (English UDT) in less than 20 seconds, giving an accuracy of 68.28% (UAS) or 66.30% (LAS).

Tools

Currently, Vitk consists of three fundamental tools for text processing:

Word segmentation
Part-of-speech tagging
Dependency parsing

The word segmentation tool is specific to the Vietnamese language. The other tools are general and can be trained to parse any language. We are working to develop and integrate more fundamental tools to Vitk such as named entity recognition, constituency parsing, opinion mining, etc.

Setup and Compilation

Prerequisites: A Java Development Kit (JDK), version 7.0 or later JDK. Apache Maven version 3.0 or later Maven. Make sure that two following commands work perfectly in your shell (console window).

java -version

mvn -version
Download a prebuilt version of Apache Spark. Vitk uses Spark version 1.6.x. Unpack the compressed file to a directory, for example ~/spark where ~ is your home directory.
Download Vitk, either a binary archive or its source code. The repository URL of the project is Vitk. The source code version is preferable. It is easy to compile and package Vitk: go to the top-level directory of Vitk and invoke the following command at a shell window:

mvn compile package

Apache Maven will automatically resolve and download dependency libraries required by Vitk. Once the process finishes, you should have a binary jar file vn.vitk-3.0.jar in the sub-directory target.

Running

Data Files

Data files used by Vitk are specified in sub-directories of the directory dat, corresponding to its integrated tools.

The data used by word segmentation are in the sub-directory dat/tok.
The data used by part-of-speech tagging are in the sub-directory dat/tag.
The data used by dependency parsing are in the sub-directory dat/dep.

These folders can contain data specific to a natural language in use. Each language is specified further by a sub-directory whose name is an abbreviation of the language name, for example vi for Vietnamese, en for English, fr for French, etc.

Vitk can run as an application on a stand-alone cluster mode or on a real cluster. If it is run on a cluster, it is required that all machines in the cluster are able to access the same data files, which are normally located in a shared directory readable by all the machines.

If you use a Unix-like operating system (Unix/Linux/MacOS), it is easy to share or "export" a directory via a network file system (NFS). By default, Vitk searches for data files in the directory /export/dat/. Therefore, you need to copy the sub-directories dat/* into that directory, so that you have some folders as follows:

/export/dat/tok
/export/dat/tok/whitespace.model
/export/dat/tag/vi/cmm
/export/dat/dep/vi/mlp

If you run Vitk on a stand-alone cluster mode, it is sufficient to create the data folders specified above on your single machine. The NFS stuffs can be ignored.

Usage

Vitk is an Apache Spark application, you run it by submitting the main JAR file vn.vitk-3.0.jar to Apache Spark. The main class of the toolkit is vn.vitk.Vitk which selects the desired tool by following arguments provided by the user.

The general arguments of Vitk are as follows:

-m <master-url>: the master URL for the cluster, for example spark://192.168.1.1:7077. If you do not have a cluster, you can ignore this parameter. In this case, Vitk uses the stand-alone cluster mode, which is defined by local[*], that is, it uses all the CPU cores of your single machine for parallel processing.
-t <tool>: the tool to run, where tool is an abbreviation of the tool: tok for word segmentation (or tokenization); tag for part-of-speech tagging, dep for dependency parsing. If this argument is not specified, the default tok tool is used.
-l <language>: the natural language to process, where language is an abbreviation of language name which is either vi (Vietnamese) or en (English). If this argument is not specified, the default language is Vietnamese.
-v: this parameter does not require argument. If it is used, Vitk runs in verbose mode, in which some intermediate information will be printed out during the processing. This is useful for debugging.

Note that if you are processing very large texts, for a better performance, you should consider to set appropriate options of the spark-submit command, in particular, --executor-memory and --driver-memory. See more on submitting Apache Spark applications.

In addition to the general arguments above, a specific tool of Vitk requires its own arguments. The usage of each tool is described in their corresponding page as follows:

You can also import the source code of Vitk to your favorite IDE (Eclipse, Netbeans, etc), compile and run from source, for example, launch the class vn.vitk.tok.Tokenizer for word segmentation, providing appropriate arguments as described above.

Documentation

The algorithms used by the tools of Vitk can be found in some related scientific publications. However, some of the main methods implemented in Vitk have been, and will be described in a more accessible way by blog posts. For example, the word segmentation method is described in:

Contribution Guidelines

Writing tests
Code review
Contributions

Contact

Any bug reports, suggestions and collaborations are welcome. I am reachable at:

LE-HONG Phuong, http://mim.hus.edu.vn/lhp/
College of Science, Vietnam National University in Hanoi

vn.vitk's People

Contributors

Stargazers

Watchers

Forkers

tuanduong thuvh e346m tarrasch tuyendothanh vuthaihoc huynguyen7994 huydx lequi vietthang207 dinhbinh1610 bachhovuong310 cyberliem ngandec trietnm2 minhson-kaist thuongdv58 umbalaloan ngayngo9x wbeater vanms1989 vunb vuongnm wosea needml myvoyage davidtranno1 tanthml blakice12 danielphamvt trduclong duyet decisionnguyen phanduc nguyentrucdn octoberstorm ihsara datasart pnanh-github duytien1993ht talk114 anhnguyenphi ngocphuongnb khanhnguyenneka behitek anhndn thientu ongxabeou minhdqvn duydo iotmember khanhtc3010 hoangtrongphuc telosma ntanhhus newhope vietdungtrinh donhatquang daniel-007 khoazero123 4beginner iamatsundere tuankiet1708 tonquangtu congdinh lyn203 pvt961996 thelemontree10 quocvuong82 hungdv98 tukejai youreme brandwatchltd nahnaht anhtuan23 yeungon minidivn hpham04 xonguyen jiang117 phungvanhoa ithhlvn dunglason6789p kellyfire611 kurodenjiro good-repos nmnhut0208 jiezhanggt quyenthucdoan h3d-longnsp sdwardsmeyerss83975

vn.vitk's Issues

Lỗi chương trình chạy mãi, ko bắn ra exception

Em chào thầy,
Em có sử dụng thư viện thầy viết cho việc tách từ, khi em tách từ một văn bản thì chương trình cứ chạy mãi, ko có exception nào hay thông tin log gì ở màn hình. Em debug thì biết được dòng văn bản làm cho chương trình như vậy là đoạn string này: "Sinh ngày…………… tháng…………… năm ....................................................................".
Mong thầy sớm khắc phục lỗi này ạ !
Em cảm ơn thầy!

Support DongDu Pointwise Implementation

DongDu mentioned that it can archive 98% at Pointwise For Vietnamese Word Segmentation. However, DongDu has not been maintained for 2 years (not answer, not accept pull-request, ...). Also, DongDu's implementation is not good enough (lot of problems with output when I run predictor tool on 100MB corpus)

Should we have an additional implementation for DongDu beside your hybrid approach?

Thanks and best regards,

Improve Word Segmentation

With the sentence Với Thúy Kiều, đây lại là một lần cô dang dở.
Với Thúy Kiều is tokenized as a word Với_Thúy_Kiều. Similarly, we have Và, Bởi, ...

How should we do to improve the word segmentation in these cases?

lỗi khi load vn.vitk-3.0.jar

khi e chạy lệnh sau
~/spark/bin/spark-submit --class=vn.vitk.Vitk ~/vn.vitk/target/vn.vitk-3.0.jar -i -o
thì lỗi java.lang.ClassNotFoundException: vn.vitk.Vitk ạ ????

where should I create /export/dat/... folder?

hi! I meet the same problem with issue #13
<Add option to run on single machine #13>
and you answered “ you need to do is to create the data folder /export/dat/ on that machine”，but，where should I create the file, under the C: disk or somewhere? can you give me more details?

Hỏi về vntik

Em chào thầy ạ, em có dùng thử phiên bản mới nhất và làm như hướng dẫn rồi import project vào IDE là eclipse và netbean nhưng em chạy thử mà chương trình vẫn báo lỗi ạ. Mong thầy giải đáp giúp em ạ. Em cảm ơn thầy

em chào thầy. khi em thực hiện theo "Cách chạy phân đoạn từ" nó xuất hiện lỗi sau:

Error: Cannot load main class from JAR file:/home/cuong/vitk/target/vn.vitk-3.0.jar
Run with --help for usage help or --verbose for debug output
dưới đây là toàn bộ các thao tác thực hiện của em:
mở terminal ra nó đã có dòng sau: "bash: /usr/libexec/java_home: No such file or directory". không biết nó có ảnh hưởng đến các thao tác sau đó không !
bash: /usr/libexec/java_home: No such file or directory
cuong@ubuntu:~/spark$ ./bin/spark-submit /vitk/target/vn.vitk-3.0.jar -i Q1-06-D.txt -o ma1.txt
Error: Cannot load main class from JAR file:/home/cuong/vitk/target/vn.vitk-3.0.jar
Run with --help for usage help or --verbose for debug output
cuong@ubuntu:/spark$
mong thầy chỉ ra lỗi giùm em và chỉ em cách fix lỗi. Mong nhận được hồi âm sớm từ thầy. em cảm ơn thầy

Lỗi lặp vô hạn khi dùng tách từ

Em chào thầy!
Em cảm ơn thầy vì tool xử lý tiếng việt hữu ích này. Gần đây em sử dụng vntokenizer cho bài toán công ty mình. Tuy nhiên em nhận thấy với một số đầu vào nhất định, bản tokenizer này rơi vào tình trạng chạy lặp mãi không dừng.
Em gửi mẫu văn bản gặp tình trạng này:
Nguyên lý thẩm định giáNguyên lý thẩm định giáTài liệu Luận văn Sách Download /5 Báo cáo Báo cáo tài liệu này Bình luận Nguyên lý thẩm định giá Upload: KhoaTranAnh.dokovn | Ngày: 13/12/2012 | Lượt xem: 358 | Tải về: 3 Chuyên mục: Kinh tế » Kế Toán - Kiểm Toán Nguyên lý thẩm định giá Đánh dấu chủ đề Tag: biệt thự hoàng quân biệt thự nguyên lý thẩm định giá lô đất bài tập tên môn học mã môn học ngôi nhà cách nhau thẩm định Nguyên tắc thẩm định giá Nguyên lí thẩm định giá Các nguyên tắc thẩm định giá Nguyên lý cơ bản của thẩm định Nguyên tắc và phương pháp thẩm định giá Tiêu chuẩn thẩm định giá Quy trình thẩm định giá Báo cáo thẩm định giá Cơ sở thẩm định giá Thẩm định giá bất động sản Bài tập về thẩm định giá Các phương pháp thẩm định giá Pháp luật về thẩm định giá Tài liệu về thẩm định giá Giáo trình phương pháp thẩm định giá Các phương pháp thẩm định giá bất động sản Thẩm định và đánh giá rủi ro tín dụng Giáo trình luật kinh tế trong thẩm định giá Hệ thống tiêu chuẩn thẩm định giá việt nam Phân tích và thẩm định giá chứng khoán

Rất mong thầy sớm khắc phục lỗi này cho tokenizer. Em cảm ơn thầy.

Constant variables should be uppercase

There are many constant variables in some classes. They should be uppercase.
Follow Code Conventions

Example from vn.vitk/src/main/java/vn/vitk/util/TokenNormalizer.java

static Pattern number = Pattern.compile("^([\\+\\-\\.,]?([0-9]*)?[0-9]+([\\.,]\\d+)*[%°]?)$"); 
static Pattern punctuation = Pattern.compile("^([\\?!\\.:;,\\-\\+\\*\"'\\(\\)\\[\\]\\{\\}]+)$");
static Pattern email = Pattern.compile("^(\\w[\\-\\._\\w]*\\w@\\w[\\-\\._\\w]*\\w\\.\\w{2,3})$");
static Pattern date = Pattern.compile("^(\\d+[\\-\\./]\\d+([\\-\\./]\\d+)?)$|^(năm_)\\d+$");
static Pattern code = Pattern.compile("^(\\d+[\\-\\._/]?[\\w\\W\\-\\._/]+)$|^([\\w\\W\\-\\._/]+\\d+[\\-\\._/\\d]*[\\w\\W]*)$");
static Pattern website = Pattern.compile("^(\\w+\\.\\w+)(/\\w*)*$");
static Pattern xmlTag = Pattern.compile("^</?\\w+>$");

Build resource vào file jar

Hi thầy,
việc phải copy resource lên các máy khi chạy yarn-cluster khiến việc deploy khá khó khăn( người chạy ko có quyền trên các node slave, thêm mới node phải copy data ,...)
vậy có thể sửa theo hường build file resource vào jar như sau:
Trong thẻ poml.xml

<build>
		<resources>
			<resource>
				<directory>models</directory>
				<targetPath>models</targetPath>
			</resource>			
		</resources>
</build>

như thế khi build ra file jar (cụ thể mở = trình giải nén) sẽ thấy thư mục dictionary trong file jar.
Khi muốn đọc 1 file trong resource thay vì dùng FileInputStream ta lấy stream như sau:
Tenclass.class.getResourceAsStream("/dictionary/test.txt")

Lỗi khi chạy tagging trên hadoop

Hi anh,

Em chạy vn.vitk ở PC của em rất ok, nhưng sau đó đưa lên server chạy hadoop thì gặp lỗi này, rất mong anh giúp đỡ ạ

16/10/24 09:28:28 WARN org.apache.parquet.hadoop.ParquetRecordReader: Can not initialize counter due to context is not a instance of TaskInputOutputContext, but is org.apache.hadoop.mapreduce.task.TaskAttemptContextImpl
16/10/24 09:28:28 INFO org.apache.parquet.hadoop.InternalParquetRecordReader: RecordReader initialized will read a total of 1 records.
16/10/24 09:28:28 INFO org.apache.parquet.hadoop.InternalParquetRecordReader: at row 0. reading next block
16/10/24 09:28:28 INFO org.apache.hadoop.io.compress.zlib.ZlibFactory: Successfully loaded & initialized native-zlib library
16/10/24 09:28:28 INFO org.apache.hadoop.io.compress.CodecPool: Got brand-new decompressor [.gz]
16/10/24 09:28:28 INFO org.apache.parquet.hadoop.InternalParquetRecordReader: block read in memory in 26 ms. row count = 1
16/10/24 09:28:28 INFO org.apache.hadoop.mapred.FileInputFormat: Total input paths to process : 1
16/10/24 09:28:28 INFO org.apache.hadoop.mapred.FileInputFormat: Total input paths to process : 1
16/10/24 09:28:29 INFO org.apache.parquet.hadoop.ParquetFileReader: Initiating action with parallelism: 5
16/10/24 09:28:29 WARN org.apache.parquet.hadoop.ParquetRecordReader: Can not initialize counter due to context is not a instance of TaskInputOutputContext, but is org.apache.hadoop.mapreduce.task.TaskAttemptContextImpl
16/10/24 09:28:29 INFO org.apache.parquet.hadoop.InternalParquetRecordReader: RecordReader initialized will read a total of 1 records.
16/10/24 09:28:29 INFO org.apache.parquet.hadoop.InternalParquetRecordReader: at row 0. reading next block
16/10/24 09:28:29 INFO org.apache.parquet.hadoop.InternalParquetRecordReader: block read in memory in 2 ms. row count = 1
Exception in thread "main" java.lang.NoSuchMethodError: org.apache.spark.sql.DataFrameReader.load(Ljava/lang/String;)Lorg/apache/spark/sql/DataFrame;
at vn.vitk.tag.CMMModel$CMMModelReader.load(CMMModel.java:357)
at vn.vitk.tag.CMMModel$CMMModelReader.load(CMMModel.java:350)
at vn.vitk.tag.CMMModel.load(CMMModel.java:339)
at vn.vitk.tag.Tagger.load(Tagger.java:165)
at vn.vitk.Vitk.main(Vitk.java:199)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:729)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:185)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:210)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:124)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)

Tiếng Việt không dấu

Em chào thầy ạ. Em hiện là sinh viên học computer science, em có quan tâm đến vn-nlp ạ.

Thầy ơi, cho em hỏi là nếu mình input tiếng việt không dấu, ví dụ "sinh vien vi pham quy che thi cu" thì output sẽ thế nào ạ.

Part-of-Speech Tagging not working

I've tried this command ./bin/spark-submit ~/vitk/target/vn.vitk-3.0.jar -t tag -a tag -i <input-file> -o <output-file> but it does not generate output. This is the only error in the log output:

Exception in thread "main" java.lang.NoSuchMethodError: org.apache.spark.sql.DataFrameReader.load(Ljava/lang/String;)Lorg/apache/spark/sql/DataFrame; at vn.vitk.tag.CMMModel$CMMModelReader.load(CMMModel.java:357) at vn.vitk.tag.CMMModel$CMMModelReader.load(CMMModel.java:350) at vn.vitk.tag.CMMModel.load(CMMModel.java:339) at vn.vitk.tag.Tagger.load(Tagger.java:165) at vn.vitk.Vitk.main(Vitk.java:190) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:755) at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:180) at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:205) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:119) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)

The input file is the output of segmentation tool.

ĐOÀN 1 : ( 16 - 20 khách đi từ 28/7/2017 – 30/7/2017 )
Ngày 1 :
7h – 7h15 Đón xe đi đến Hạ_Long ( Vui long cho điểm đón cụ_thể )
11h30 : Đến_Hạ_Long , lên tàu và nhận phòng . Sau đó ăn trưa trên tàu
Chiều thăm quan Vịnh_Hạ_Long , Hang_Sửng_sốt ( Tuỳ chuong trình của từng tàu khác nhau )
Ăn tối , sau đó tự_do vui_chơi trên tàu . Tham_gia các chương_trình trên tàu như câu mực
Ngủ đêm trên tàu 4 sao - Flamingo_Cruise hoặc tương_đương
Ngày 2 :
6h – 6h30 Sáng sẽ có hoạt_động tập Tai_Chi trên bong tàu
Ăn sáng , sau đó sẽ đi thăm đảo Titop và tắm biển

Thank you very much.

Add option to run on single machine

Can you make an argument to run on single machine, cause it required data on /export/dat/*
So I have to change code on every "/export" to this repo path.
An argument or a config data path is much easier.

Some VN words were not parsed correctly

Dear Mr. PhuongLH,

Thank you for sharing this great resource for Vietnamese language NLP.
I tested running the VITK with the test data below:

Quan điểm của Bộ Công Thương là nếu để làm thủy điện thì đây là dự án nhỏ nhằm tận dụng tài nguyên nước và được Chính phủ cho phép làm thì Bộ Công Thương ủng hộ, không để lãng phí nguồn nước
Trước đó, ngày 5/5, Vụ trưởng Vụ Giám sát và Thẩm định đầu tư cũng khẳng định, dự án giao thông thủy xuyên Á trên sông Hồng kết hợp thủy điện mới ở mức sơ khai, ý tưởng đề xuất.
Ông Nguyễn Xuân Tự
Nguyễn Xuân Tự
Theo đề xuất của chủ đầu tư, mục tiêu của dự án là sẽ mở ra một tuyến vận tải thông suốt trên sông Hồng.

The result were as followings:

Quan_điểm của Bộ Công_Thương la ̀ nếu để làm thuỷ_điện thì đây là dự_án nhỏ nhă ̀ m tận_dụng tài_nguyên nước và được Chính_phủ cho phép làm thì Bộ Công_Thương ủng_hộ , không để lãng_phí nguồn nước
Trước đó , nga ̀ y 5/5 , Vụ_trưởng Vụ_Giám_sát và Thẩm_định đầu_tư cu ̃ ng khă ̉ ng đi ̣ nh , dự_án giao_thông thuỷ xuyên Á trên sông Hồng kết_hợp thuỷ_điện mới ở mức sơ_khai , ý_tưởng đề_xuất .
Ông_Nguyễn_Xuân_Tự
Nguyễn_Xuân_Tự
Theo đê ̀ xuâ ́ t cu ̉ a chu ̉ đâ ̀ u tư , mu ̣ c tiêu của dự_án là sẽ mở ra một tuyến vận_tải thông_suốt trên sông Hồng .

Most of them were parsed impressively, however, some of them were parsed in a strange manner: e.g.

- ngày 5/5 --> nga ̀ y 5/5
- khẳng định --> khă ̉ ng đi ̣ nh
- đề xuất --> đê ̀ xuâ ́ t
- mục tiêu --> mu ̣ c tiêu

Kindly advise if I have mis-configure anything, or do I need to perform any further actions before I run the toolkit, in order to improve the outcome of the program.

Thanks again for your kind support.

Regards,
Pomy66

Thắc mắc về đường đi ngắn nhất trong thuật toán

Thầy ơi, em có chút thắc mắc muốn hỏi thầy ạ.

Em đọc theo tài liệu "a hybrid Approach to Word Segmentation of Vietnamese Texts" cùng blog của thầy trên FPT TechInsight, em đang hiểu đường đi ngắn nhất trên đồ thị xây dựng được là đường đi có số cung ít nhất, hay các cung có trọng số là 1. Nhưng lúc đọc code trong bản vitk-tok-5.2.jar thì em thấy các cạnh được gán trọng số là 1/(độ lớn khoảng cách giữa số hiệu 2 đỉnh). Việc gán này có ý nghĩa gì không ạ?

Em còn một thắc mắc nữa là theo tài liệu thì đường đi có xác suất lớn nhất sẽ được chọn, nhưng hình như trong vitk-tok-5.2.jar thì đường đi cuối cùng tìm được sẽ được chọn:

List<LinkedList<Integer>> paths = graph.shortestPaths();
if (paths.size() > 0) {
    LinkedList<Integer> selectedPath = (LinkedList)paths.get(paths.size() - 1);
    ...

Em chưa hiểu vì sao lại chọn như vậy ạ.

Mong được thầy hướng dẫn. Em xin cảm ơn ạ.

Project failed to run

I have encountered a problem when trying to run the project and couldn't fix it. This is my full log:

------------------------------------------------------------------------
Building A Vietnamese Processing Toolkit 3.0
------------------------------------------------------------------------

--- exec-maven-plugin:1.2.1:exec (default-cli) @ vn.vitk ---
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
17/07/21 23:58:26 INFO SparkContext: Running Spark version 1.6.1
17/07/21 23:58:27 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
17/07/21 23:58:28 INFO SecurityManager: Changing view acls to: VinhTL
17/07/21 23:58:28 INFO SecurityManager: Changing modify acls to: VinhTL
17/07/21 23:58:28 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(VinhTL); users with modify permissions: Set(VinhTL)
17/07/21 23:58:30 INFO Utils: Successfully started service 'sparkDriver' on port 4229.
17/07/21 23:58:31 INFO Slf4jLogger: Slf4jLogger started
17/07/21 23:58:31 INFO Remoting: Starting remoting
17/07/21 23:58:31 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://[email protected]:4243]
17/07/21 23:58:31 INFO Utils: Successfully started service 'sparkDriverActorSystem' on port 4243.
17/07/21 23:58:31 INFO SparkEnv: Registering MapOutputTracker
17/07/21 23:58:31 INFO SparkEnv: Registering BlockManagerMaster
17/07/21 23:58:31 INFO DiskBlockManager: Created local directory at C:\Users\VinhTL\AppData\Local\Temp\blockmgr-9fd6d290-d9c6-46bd-90c5-3a29fb5404b2
17/07/21 23:58:32 INFO MemoryStore: MemoryStore started with capacity 1117.9 MB
17/07/21 23:58:32 INFO SparkEnv: Registering OutputCommitCoordinator
17/07/21 23:58:32 INFO Utils: Successfully started service 'SparkUI' on port 4040.
17/07/21 23:58:32 INFO SparkUI: Started SparkUI at http://192.168.188.2:4040
17/07/21 23:58:32 INFO Executor: Starting executor ID driver on host localhost
17/07/21 23:58:33 INFO Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 4262.
17/07/21 23:58:33 INFO NettyBlockTransferService: Server created on 4262
17/07/21 23:58:33 INFO BlockManagerMaster: Trying to register BlockManager
17/07/21 23:58:33 INFO BlockManagerMasterEndpoint: Registering block manager localhost:4262 with 1117.9 MB RAM, BlockManagerId(driver, localhost, 4262)
17/07/21 23:58:33 INFO BlockManagerMaster: Registered BlockManager
17/07/21 23:58:36 INFO MemoryStore: Block broadcast_0 stored as values in memory (estimated size 107.7 KB, free 107.7 KB)
17/07/21 23:58:36 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 9.8 KB, free 117.5 KB)
17/07/21 23:58:36 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on localhost:4262 (size: 9.8 KB, free: 1117.9 MB)
17/07/21 23:58:36 INFO SparkContext: Created broadcast 0 from textFile at Tokenizer.java:83
17/07/21 23:58:38 WARN : Your hostname, VinhTL-PC resolves to a loopback/non-reachable address: fe80:0:0:0:88f7:b80f:643:8365%eth21, but we couldn't find any external IP address!
17/07/21 23:58:43 INFO FileInputFormat: Total input paths to process : 1
17/07/21 23:58:43 INFO SparkContext: Starting job: collect at Tokenizer.java:83
17/07/21 23:58:43 INFO DAGScheduler: Got job 0 (collect at Tokenizer.java:83) with 2 output partitions
17/07/21 23:58:43 INFO DAGScheduler: Final stage: ResultStage 0 (collect at Tokenizer.java:83)
17/07/21 23:58:43 INFO DAGScheduler: Parents of final stage: List()
17/07/21 23:58:43 INFO DAGScheduler: Missing parents: List()
17/07/21 23:58:43 INFO DAGScheduler: Submitting ResultStage 0 (export/dat/tok/regexp.txt MapPartitionsRDD[1] at textFile at Tokenizer.java:83), which has no missing parents
17/07/21 23:58:43 INFO MemoryStore: Block broadcast_1 stored as values in memory (estimated size 3.1 KB, free 120.6 KB)
17/07/21 23:58:43 INFO MemoryStore: Block broadcast_1_piece0 stored as bytes in memory (estimated size 1811.0 B, free 122.3 KB)
17/07/21 23:58:43 INFO BlockManagerInfo: Added broadcast_1_piece0 in memory on localhost:4262 (size: 1811.0 B, free: 1117.9 MB)
17/07/21 23:58:43 INFO SparkContext: Created broadcast 1 from broadcast at DAGScheduler.scala:1006
17/07/21 23:58:43 INFO DAGScheduler: Submitting 2 missing tasks from ResultStage 0 (export/dat/tok/regexp.txt MapPartitionsRDD[1] at textFile at Tokenizer.java:83)
17/07/21 23:58:43 INFO TaskSchedulerImpl: Adding task set 0.0 with 2 tasks
17/07/21 23:58:43 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, localhost, partition 0,PROCESS_LOCAL, 2162 bytes)
17/07/21 23:58:43 INFO TaskSetManager: Starting task 1.0 in stage 0.0 (TID 1, localhost, partition 1,PROCESS_LOCAL, 2162 bytes)
17/07/21 23:58:43 INFO Executor: Running task 1.0 in stage 0.0 (TID 1)
17/07/21 23:58:43 INFO Executor: Running task 0.0 in stage 0.0 (TID 0)
17/07/21 23:58:43 INFO HadoopRDD: Input split: file:/F:/GitHub Reposities/vn.vitk/export/dat/tok/regexp.txt:0+942
17/07/21 23:58:43 INFO HadoopRDD: Input split: file:/F:/GitHub Reposities/vn.vitk/export/dat/tok/regexp.txt:942+942
17/07/21 23:58:43 INFO deprecation: mapred.tip.id is deprecated. Instead, use mapreduce.task.id
17/07/21 23:58:43 INFO deprecation: mapred.tip.id is deprecated. Instead, use mapreduce.task.id
17/07/21 23:58:43 INFO deprecation: mapred.task.id is deprecated. Instead, use mapreduce.task.attempt.id
17/07/21 23:58:43 INFO deprecation: mapred.task.is.map is deprecated. Instead, use mapreduce.task.ismap
17/07/21 23:58:43 INFO deprecation: mapred.task.partition is deprecated. Instead, use mapreduce.task.partition
17/07/21 23:58:43 INFO deprecation: mapred.job.id is deprecated. Instead, use mapreduce.job.id
17/07/21 23:58:43 INFO Executor: Finished task 1.0 in stage 0.0 (TID 1). 2960 bytes result sent to driver
17/07/21 23:58:43 INFO Executor: Finished task 0.0 in stage 0.0 (TID 0). 3058 bytes result sent to driver
17/07/21 23:58:43 INFO TaskSetManager: Finished task 0.0 in stage 0.0 (TID 0) in 374 ms on localhost (1/2)
17/07/21 23:58:43 INFO TaskSetManager: Finished task 1.0 in stage 0.0 (TID 1) in 321 ms on localhost (2/2)
17/07/21 23:58:43 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool 
17/07/21 23:58:43 INFO DAGScheduler: ResultStage 0 (collect at Tokenizer.java:83) finished in 0,434 s
17/07/21 23:58:43 INFO DAGScheduler: Job 0 finished: collect at Tokenizer.java:83, took 0,651888 s
17/07/21 23:58:44 INFO BlockManagerInfo: Removed broadcast_1_piece0 on localhost:4262 in memory (size: 1811.0 B, free: 1117.9 MB)
17/07/21 23:58:44 INFO ContextCleaner: Cleaned accumulator 1
17/07/21 23:58:44 INFO BlockManagerInfo: Removed broadcast_0_piece0 on localhost:4262 in memory (size: 9.8 KB, free: 1117.9 MB)
Either an input file or an URL must be provided!
17/07/21 23:58:49 INFO SparkContext: Invoking stop() from shutdown hook
17/07/21 23:58:49 INFO SparkUI: Stopped Spark web UI at http://192.168.188.2:4040
17/07/21 23:58:49 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
17/07/21 23:58:49 INFO MemoryStore: MemoryStore cleared
17/07/21 23:58:49 INFO BlockManager: BlockManager stopped
17/07/21 23:58:49 INFO BlockManagerMaster: BlockManagerMaster stopped
17/07/21 23:58:49 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
17/07/21 23:58:49 INFO SparkContext: Successfully stopped SparkContext
17/07/21 23:58:49 INFO ShutdownHookManager: Shutdown hook called
17/07/21 23:58:49 INFO ShutdownHookManager: Deleting directory C:\Users\VinhTL\AppData\Local\Temp\spark-6b82dd75-7fc9-44d6-8add-f2596073ec48
------------------------------------------------------------------------
BUILD FAILURE
------------------------------------------------------------------------
Total time: 31.558s
Finished at: Fri Jul 21 23:58:50 ICT 2017
Final Memory: 12M/220M
------------------------------------------------------------------------
Failed to execute goal org.codehaus.mojo:exec-maven-plugin:1.2.1:exec (default-cli) on project vn.vitk: Command execution failed. Process exited with an error: 1 (Exit value: 1) -> [Help 1]
org.apache.maven.lifecycle.LifecycleExecutionException: Failed to execute goal org.codehaus.mojo:exec-maven-plugin:1.2.1:exec (default-cli) on project vn.vitk: Command execution failed.
	at org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:217)
	at org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:153)
	at org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:145)
	at org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject(LifecycleModuleBuilder.java:84)
	at org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject(LifecycleModuleBuilder.java:59)
	at org.apache.maven.lifecycle.internal.LifecycleStarter.singleThreadedBuild(LifecycleStarter.java:183)
	at org.apache.maven.lifecycle.internal.LifecycleStarter.execute(LifecycleStarter.java:161)
	at org.apache.maven.DefaultMaven.doExecute(DefaultMaven.java:320)
	at org.apache.maven.DefaultMaven.execute(DefaultMaven.java:156)
	at org.apache.maven.cli.MavenCli.execute(MavenCli.java:537)
	at org.apache.maven.cli.MavenCli.doMain(MavenCli.java:196)
	at org.apache.maven.cli.MavenCli.main(MavenCli.java:141)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:497)
	at org.codehaus.plexus.classworlds.launcher.Launcher.launchEnhanced(Launcher.java:290)
	at org.codehaus.plexus.classworlds.launcher.Launcher.launch(Launcher.java:230)
	at org.codehaus.plexus.classworlds.launcher.Launcher.mainWithExitCode(Launcher.java:409)
	at org.codehaus.plexus.classworlds.launcher.Launcher.main(Launcher.java:352)
Caused by: org.apache.maven.plugin.MojoExecutionException: Command execution failed.
	at org.codehaus.mojo.exec.ExecMojo.execute(ExecMojo.java:362)
	at org.apache.maven.plugin.DefaultBuildPluginManager.executeMojo(DefaultBuildPluginManager.java:101)
	at org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:209)
	... 19 more
Caused by: org.apache.commons.exec.ExecuteException: Process exited with an error: 1 (Exit value: 1)
	at org.apache.commons.exec.DefaultExecutor.executeInternal(DefaultExecutor.java:377)
	at org.apache.commons.exec.DefaultExecutor.execute(DefaultExecutor.java:160)
	at org.codehaus.mojo.exec.ExecMojo.executeCommandLine(ExecMojo.java:610)
	at org.codehaus.mojo.exec.ExecMojo.execute(ExecMojo.java:352)
	... 21 more

Lỗi khi chạy tagging

Chào anh, em chạy thử thư viện của anh trên spark cloud với command như sau:

spark-submit vn.vitk/target/vn.vitk-3.0.jar -t tag -i /regang/tag/test1.txt -o /regang/tag/test1.tag

File dữ liệu test
test1.txt

thì thấy có lỗi này xuất hiện
16/08/09 04:33:03 INFO spark.SparkContext: Created broadcast 0 from textFile at ReadWrite.scala:285
16/08/09 04:33:04 INFO mapred.FileInputFormat: Total input paths to process : 0
Exception in thread "main" java.lang.UnsupportedOperationException: empty collection
at org.apache.spark.rdd.RDD$$anonfun$first$1.apply(RDD.scala:1344)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:111)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:316)
at org.apache.spark.rdd.RDD.first(RDD.scala:1341)
at org.apache.spark.ml.util.DefaultParamsReader$.loadMetadata(ReadWrite.scala:285)
at org.apache.spark.ml.util.DefaultParamsReader.loadMetadata(ReadWrite.scala)
at vn.vitk.tag.CMMModel$CMMModelReader.load(CMMModel.java:353)
at vn.vitk.tag.CMMModel$CMMModelReader.load(CMMModel.java:350)
at vn.vitk.tag.CMMModel.load(CMMModel.java:339)
at vn.vitk.tag.Tagger.load(Tagger.java:165)
at vn.vitk.Vitk.main(Vitk.java:190)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:731)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:181)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:206)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:121)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
16/08/09 04:33:04 INFO spark.SparkContext: Invoking stop() from shutdown hook
16/08/09 04:33:04 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/static/sql,null}
16/08/09 04:33:04 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/SQL/execution/json,null}

How to specify output format Of tagger so that we can use the output for Dependency Parsing

First of all, allow my humble self to express my gratitude and appreciation for your work. It seems to be one of the best tool for Natural Language Processing in Vietnamese. I stumbled upon the repository while walking my first steps on this area and find it really helpful. My goal is to build a sentiment Analysis tool/ Framework for Vietnamese.

The tagger and segmentation seem to be in good working order. However it is really inconvenient when I want to use the ouput from tagger to run Dependency Parsing. I've opened the source code and took a look there with the hope that I can rewrite it so that instead of this format:

{"sentence":"Đức đã ngã gục . Sau 90 phút với 65% kiểm_soát bóng vẫn đã không có bàn thắng , cho_dù đôi lúc đã rất gần với nó ","prediction":"Np R V A . E M N E N V N R R R V N V , C N N R R A E P "}

It will represent this format, which is eligible for Dependency Parsing:

`{"sentence":"Đức/Np đã/R ngã/V gục/A . Sau/E 90/M phút/N với/E 65%/N kiểm_soát/V bóng/N vẫn/R đã/R không/R có/V bàn/N thắng/V , cho_dù/C đôi/N lúc/N đã/R rất/R gần/A với/E nó/P"}``

However, much of the code has been simplified, I found it's hard to write my own function there. May I ask if there's a way to specify output format? In the source code it seems to have three output format: JSON, PARQUET and TEXT (the TEXT seems to be the default).

In addition to that, If it's possible, I hope to be able to contribute to the repository under your guidance. Thanks alot.

Not Found Links from https://github.com/phuonglh/vn.vitk#documentation

Please, update below links in section Documentation. They are currently 404 Not Found.
http://tech.fpt.com.vn/en/expert-opinion/vietnamese-word-segmentation-part-i-nd498043.html
http://tech.fpt.com.vn/en/expert-opinion/vietnamese-word-segmentation-part-ii-nd498054.html

Tốc độ tokenize chậm

Tốc độ tokenizer rất chậm đối với long text (vd: wiki chứa nhiều paragraph dài) hoặc các text chứa liên tiếp nhiều kí tự (vd: !!!)
Do không có option split_to_sentences(text) nên em không biết làm sao để giảm kích thước của text cần xử lý.

Thư viện có hỗ trợ output KQ ra XML không ạ ?

Em chào thầy,

So với bản 2011 thì em không biết các bản sau này thư viện của thầy có hỗ trợ output KQ ra file XML được không ạ ?

Em đọc readme thì không thấy đề cập gì.

Em cám ơn thầy ạ.