pvoosten / explicit-semantic-analysis Goto Github PK

View Code? Open in Web Editor NEW

34.0 6.0 9.0 437 KB

Wikipedia-based Explicit Semantic Analysis, as described by Gabrilovich and Markovitch

License: GNU Affero General Public License v3.0

Java 100.00%

semantic-analysis lucene java-8 wikipedia-dump esa concept vector explicit-semantic-analysis java

explicit-semantic-analysis's People

Contributors

Stargazers

Watchers

Forkers

vmanisha alrehamy guherbozdogan wt4github interdonatos zansouye01 ashish54 skfzyy dreamcloudapp

explicit-semantic-analysis's Issues

TokenStream contract violation

Hi,
I am trying to use your ESA implementation, and while processing the wikipedia dump an exception that I don't understand happened. Can you help me with that?
Thanks

Note: I've tried different wikipedia dumps (en & nl), the same error happened.

Exception in thread "main" java.lang.IllegalStateException: TokenStream contract violation: reset()/close() call missing, reset() called multiple times, or subclass does not call super.reset(). Please see Javadocs of TokenStream class for more information about the correct consuming workflow.
at org.apache.lucene.analysis.Tokenizer$1.read(Tokenizer.java:110)
at org.apache.lucene.analysis.wikipedia.WikipediaTokenizerImpl.zzRefill(WikipediaTokenizerImpl.java:574)
at org.apache.lucene.analysis.wikipedia.WikipediaTokenizerImpl.getNextToken(WikipediaTokenizerImpl.java:781)
at org.apache.lucene.analysis.wikipedia.WikipediaTokenizer.incrementToken(WikipediaTokenizer.java:200)
at org.apache.lucene.analysis.standard.StandardFilter.incrementToken(StandardFilter.java:49)
at org.apache.lucene.analysis.core.LowerCaseFilter.incrementToken(LowerCaseFilter.java:54)
at org.apache.lucene.analysis.util.FilteringTokenFilter.incrementToken(FilteringTokenFilter.java:50)
at org.apache.lucene.index.DocInverterPerField.processFields(DocInverterPerField.java:109)
at org.apache.lucene.index.DocFieldProcessor.processDocument(DocFieldProcessor.java:248)
at org.apache.lucene.index.DocumentsWriterPerThread.updateDocument(DocumentsWriterPerThread.java:253)
at org.apache.lucene.index.DocumentsWriter.updateDocument(DocumentsWriter.java:465)
at org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:1537)
at org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:1207)
at org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:1188)
at be.vanoosten.esa.WikiIndexer.index(WikiIndexer.java:156)
at be.vanoosten.esa.WikiIndexer.endElement(WikiIndexer.java:128)
at com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.endElement(Unknown Source)
at com.sun.org.apache.xerces.internal.xinclude.XIncludeHandler.endElement(Unknown Source)
at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanEndElement(Unknown Source)
at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl$FragmentContentDriver.next(Unknown Source)
at com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl.next(Unknown Source)
at com.sun.org.apache.xerces.internal.impl.XMLNSDocumentScannerImpl.next(Unknown Source)
at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanDocument(Unknown Source)
at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(Unknown Source)
at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(Unknown Source)
at com.sun.org.apache.xerces.internal.parsers.XMLParser.parse(Unknown Source)
at com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.parse(Unknown Source)
at com.sun.org.apache.xerces.internal.jaxp.SAXParserImpl$JAXPSAXParser.parse(Unknown Source)
at com.sun.org.apache.xerces.internal.jaxp.SAXParserImpl.parse(Unknown Source)
at javax.xml.parsers.SAXParser.parse(Unknown Source)
at be.vanoosten.esa.WikiIndexer.parseXmlDump(WikiIndexer.java:95)
at be.vanoosten.esa.Main.indexing(Main.java:175)
at be.vanoosten.esa.Main.main(Main.java:64)

restrict wiki

I have a question about ESA implementation. I want to apply this method in My Master Thesis, but i want restrict its knowlege Database (Only papers about Design Patterns). You can help me?:((

Lucene: exception - Query parser encountered <EOF> after “some word”

I got a problem when trying to read a dataset with special characters and trying to get the concept vector.
This is easily solve by adding the escape function in the Vectorizer class

public ConceptVector vectorize(String text) throws ParseException, IOException {
        Query query = queryParser.parse(**QueryParser.escape(text)**);
        TopDocs td = searcher.search(query, conceptCount);
        return new ConceptVector(td, indexReader);
    }

Great implementation by the way! Thanks

Source: https://stackoverflow.com/questions/10259907/lucene-exception-query-parser-encountered-eof-after-some-word/10259944

whose UTF8 encoding is longer than the max length 32766

Dear, I appreciate your contribution to the community.
I have started working with ESA, but when it runs it gives me problems: "Document contains at least one immense term in field =" text "(whose UTF8 encoding is longer than the maximum length 32766), all of which were skipped. The analyzer does not produce such terms The prefix of the first immense term is: '[6c 61 6c 61 6c 64 6b 6a 66 76 6e 74 75 69 76 62 79 6e 65 72 75 72 72 72 72 72 72 72 72 72] ... '"
I tried to change the .bz2 file and not (just to see if they had a problem). I hope you can help me.

accumulated size of entities is "50.000.001" that exceeded the "50.000.000" limit set by "FEATURE_SECURE_PROCESSING"

Hi,
i am trying to run your esa implementation. However i get an error during indexing this wiki dump:
"enwiki-20170320-pages-articles-multistream.xml.bz2". This is a 13,7GB dump by the way.
You can find them on this link: https://dumps.wikimedia.org/enwiki/20170320/

The errors says:

The accumulated size of entities is "50.000.001" that exceeded the "50.000.000" limit set by "FEATURE_SECURE_PROCESSING".

I wonder if you ran into the same problem and if you know how i can set the limit or disable them altogether. Thank you in advance.

Apr 18, 2017 1:02:12 PM be.vanoosten.esa.WikiIndexer parseXmlDump
SEVERE: null
org.xml.sax.SAXParseException; lineNumber: 64243259; columnNumber: 371; JAXP00010004: The accumulated size of entities is "50.000.001" that exceeded the "50.000.000" limit set by "FEATURE_SECURE_PROCESSING".
	at com.sun.org.apache.xerces.internal.util.ErrorHandlerWrapper.createSAXParseException(Unknown Source)
	at com.sun.org.apache.xerces.internal.util.ErrorHandlerWrapper.fatalError(Unknown Source)
	at com.sun.org.apache.xerces.internal.impl.XMLErrorReporter.reportError(Unknown Source)
	at com.sun.org.apache.xerces.internal.impl.XMLErrorReporter.reportError(Unknown Source)
	at com.sun.org.apache.xerces.internal.impl.XMLScanner.reportFatalError(Unknown Source)
	at com.sun.org.apache.xerces.internal.impl.XMLScanner.checkEntityLimit(Unknown Source)
	at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.handleCharacter(Unknown Source)
	at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanEntityReference(Unknown Source)
	at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl$FragmentContentDriver.next(Unknown Source)
	at com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl.next(Unknown Source)
	at com.sun.org.apache.xerces.internal.impl.XMLNSDocumentScannerImpl.next(Unknown Source)
	at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanDocument(Unknown Source)
	at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(Unknown Source)
	at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(Unknown Source)
	at com.sun.org.apache.xerces.internal.parsers.XMLParser.parse(Unknown Source)
	at com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.parse(Unknown Source)
	at com.sun.org.apache.xerces.internal.jaxp.SAXParserImpl$JAXPSAXParser.parse(Unknown Source)
	at com.sun.org.apache.xerces.internal.jaxp.SAXParserImpl.parse(Unknown Source)
	at javax.xml.parsers.SAXParser.parse(Unknown Source)
	at be.vanoosten.esa.WikiIndexer.parseXmlDump(WikiIndexer.java:95)
	at be.vanoosten.esa.Main.indexing(Main.java:227)
	at be.vanoosten.esa.Main.createEngIndex(Main.java:102)
	at be.vanoosten.esa.Main.main(Main.java:61)

Similarity of identical strings

Hi,

I set everything up for both the english and french version of wikipedia, and everything works fine (in the sense that everything compiles and runs), but I made some tests and I think I am obtaining weird results, i.e., counter-intuitive values of similarity on several texts with both languages, and low similarity when comparing a string with itself...is that normal? (I guess no)

Thanks,
Roberto

pvoosten / explicit-semantic-analysis Goto Github PK

explicit-semantic-analysis's People

Contributors

Stargazers

Watchers

Forkers

explicit-semantic-analysis's Issues

TokenStream contract violation

restrict wiki

Lucene: exception - Query parser encountered <EOF> after “some word”

whose UTF8 encoding is longer than the max length 32766

accumulated size of entities is "50.000.001" that exceeded the "50.000.000" limit set by "FEATURE_SECURE_PROCESSING"

Similarity of identical strings

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs