pvoosten / explicit-semantic-analysis Goto Github PK
View Code? Open in Web Editor NEWWikipedia-based Explicit Semantic Analysis, as described by Gabrilovich and Markovitch
License: GNU Affero General Public License v3.0
Wikipedia-based Explicit Semantic Analysis, as described by Gabrilovich and Markovitch
License: GNU Affero General Public License v3.0
Hi,
I am trying to use your ESA implementation, and while processing the wikipedia dump an exception that I don't understand happened. Can you help me with that?
Thanks
Note: I've tried different wikipedia dumps (en & nl), the same error happened.
Exception in thread "main" java.lang.IllegalStateException: TokenStream contract violation: reset()/close() call missing, reset() called multiple times, or subclass does not call super.reset(). Please see Javadocs of TokenStream class for more information about the correct consuming workflow.
at org.apache.lucene.analysis.Tokenizer$1.read(Tokenizer.java:110)
at org.apache.lucene.analysis.wikipedia.WikipediaTokenizerImpl.zzRefill(WikipediaTokenizerImpl.java:574)
at org.apache.lucene.analysis.wikipedia.WikipediaTokenizerImpl.getNextToken(WikipediaTokenizerImpl.java:781)
at org.apache.lucene.analysis.wikipedia.WikipediaTokenizer.incrementToken(WikipediaTokenizer.java:200)
at org.apache.lucene.analysis.standard.StandardFilter.incrementToken(StandardFilter.java:49)
at org.apache.lucene.analysis.core.LowerCaseFilter.incrementToken(LowerCaseFilter.java:54)
at org.apache.lucene.analysis.util.FilteringTokenFilter.incrementToken(FilteringTokenFilter.java:50)
at org.apache.lucene.index.DocInverterPerField.processFields(DocInverterPerField.java:109)
at org.apache.lucene.index.DocFieldProcessor.processDocument(DocFieldProcessor.java:248)
at org.apache.lucene.index.DocumentsWriterPerThread.updateDocument(DocumentsWriterPerThread.java:253)
at org.apache.lucene.index.DocumentsWriter.updateDocument(DocumentsWriter.java:465)
at org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:1537)
at org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:1207)
at org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:1188)
at be.vanoosten.esa.WikiIndexer.index(WikiIndexer.java:156)
at be.vanoosten.esa.WikiIndexer.endElement(WikiIndexer.java:128)
at com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.endElement(Unknown Source)
at com.sun.org.apache.xerces.internal.xinclude.XIncludeHandler.endElement(Unknown Source)
at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanEndElement(Unknown Source)
at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl$FragmentContentDriver.next(Unknown Source)
at com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl.next(Unknown Source)
at com.sun.org.apache.xerces.internal.impl.XMLNSDocumentScannerImpl.next(Unknown Source)
at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanDocument(Unknown Source)
at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(Unknown Source)
at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(Unknown Source)
at com.sun.org.apache.xerces.internal.parsers.XMLParser.parse(Unknown Source)
at com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.parse(Unknown Source)
at com.sun.org.apache.xerces.internal.jaxp.SAXParserImpl$JAXPSAXParser.parse(Unknown Source)
at com.sun.org.apache.xerces.internal.jaxp.SAXParserImpl.parse(Unknown Source)
at javax.xml.parsers.SAXParser.parse(Unknown Source)
at be.vanoosten.esa.WikiIndexer.parseXmlDump(WikiIndexer.java:95)
at be.vanoosten.esa.Main.indexing(Main.java:175)
at be.vanoosten.esa.Main.main(Main.java:64)
I have a question about ESA implementation. I want to apply this method in My Master Thesis, but i want restrict its knowlege Database (Only papers about Design Patterns). You can help me?:((
I got a problem when trying to read a dataset with special characters and trying to get the concept vector.
This is easily solve by adding the escape function in the Vectorizer class
public ConceptVector vectorize(String text) throws ParseException, IOException {
Query query = queryParser.parse(**QueryParser.escape(text)**);
TopDocs td = searcher.search(query, conceptCount);
return new ConceptVector(td, indexReader);
}
Great implementation by the way! Thanks
Dear, I appreciate your contribution to the community.
I have started working with ESA, but when it runs it gives me problems: "Document contains at least one immense term in field =" text "(whose UTF8 encoding is longer than the maximum length 32766), all of which were skipped. The analyzer does not produce such terms The prefix of the first immense term is: '[6c 61 6c 61 6c 64 6b 6a 66 76 6e 74 75 69 76 62 79 6e 65 72 75 72 72 72 72 72 72 72 72 72] ... '"
I tried to change the .bz2 file and not (just to see if they had a problem). I hope you can help me.
Hi,
i am trying to run your esa implementation. However i get an error during indexing this wiki dump:
"enwiki-20170320-pages-articles-multistream.xml.bz2". This is a 13,7GB dump by the way.
You can find them on this link: https://dumps.wikimedia.org/enwiki/20170320/
The errors says:
The accumulated size of entities is "50.000.001" that exceeded the "50.000.000" limit set by "FEATURE_SECURE_PROCESSING".
I wonder if you ran into the same problem and if you know how i can set the limit or disable them altogether. Thank you in advance.
Apr 18, 2017 1:02:12 PM be.vanoosten.esa.WikiIndexer parseXmlDump
SEVERE: null
org.xml.sax.SAXParseException; lineNumber: 64243259; columnNumber: 371; JAXP00010004: The accumulated size of entities is "50.000.001" that exceeded the "50.000.000" limit set by "FEATURE_SECURE_PROCESSING".
at com.sun.org.apache.xerces.internal.util.ErrorHandlerWrapper.createSAXParseException(Unknown Source)
at com.sun.org.apache.xerces.internal.util.ErrorHandlerWrapper.fatalError(Unknown Source)
at com.sun.org.apache.xerces.internal.impl.XMLErrorReporter.reportError(Unknown Source)
at com.sun.org.apache.xerces.internal.impl.XMLErrorReporter.reportError(Unknown Source)
at com.sun.org.apache.xerces.internal.impl.XMLScanner.reportFatalError(Unknown Source)
at com.sun.org.apache.xerces.internal.impl.XMLScanner.checkEntityLimit(Unknown Source)
at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.handleCharacter(Unknown Source)
at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanEntityReference(Unknown Source)
at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl$FragmentContentDriver.next(Unknown Source)
at com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl.next(Unknown Source)
at com.sun.org.apache.xerces.internal.impl.XMLNSDocumentScannerImpl.next(Unknown Source)
at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanDocument(Unknown Source)
at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(Unknown Source)
at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(Unknown Source)
at com.sun.org.apache.xerces.internal.parsers.XMLParser.parse(Unknown Source)
at com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.parse(Unknown Source)
at com.sun.org.apache.xerces.internal.jaxp.SAXParserImpl$JAXPSAXParser.parse(Unknown Source)
at com.sun.org.apache.xerces.internal.jaxp.SAXParserImpl.parse(Unknown Source)
at javax.xml.parsers.SAXParser.parse(Unknown Source)
at be.vanoosten.esa.WikiIndexer.parseXmlDump(WikiIndexer.java:95)
at be.vanoosten.esa.Main.indexing(Main.java:227)
at be.vanoosten.esa.Main.createEngIndex(Main.java:102)
at be.vanoosten.esa.Main.main(Main.java:61)
Hi,
I set everything up for both the english and french version of wikipedia, and everything works fine (in the sense that everything compiles and runs), but I made some tests and I think I am obtaining weird results, i.e., counter-intuitive values of similarity on several texts with both languages, and low similarity when comparing a string with itself...is that normal? (I guess no)
Thanks,
Roberto
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.