GithubHelp home page GithubHelp logo

pvoosten / explicit-semantic-analysis Goto Github PK

View Code? Open in Web Editor NEW
34.0 6.0 9.0 437 KB

Wikipedia-based Explicit Semantic Analysis, as described by Gabrilovich and Markovitch

License: GNU Affero General Public License v3.0

Java 100.00%
semantic-analysis lucene java-8 wikipedia-dump esa concept vector explicit-semantic-analysis java

explicit-semantic-analysis's People

Contributors

dependabot[bot] avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

explicit-semantic-analysis's Issues

TokenStream contract violation

Hi,
I am trying to use your ESA implementation, and while processing the wikipedia dump an exception that I don't understand happened. Can you help me with that?
Thanks

Note: I've tried different wikipedia dumps (en & nl), the same error happened.

Exception in thread "main" java.lang.IllegalStateException: TokenStream contract violation: reset()/close() call missing, reset() called multiple times, or subclass does not call super.reset(). Please see Javadocs of TokenStream class for more information about the correct consuming workflow.
at org.apache.lucene.analysis.Tokenizer$1.read(Tokenizer.java:110)
at org.apache.lucene.analysis.wikipedia.WikipediaTokenizerImpl.zzRefill(WikipediaTokenizerImpl.java:574)
at org.apache.lucene.analysis.wikipedia.WikipediaTokenizerImpl.getNextToken(WikipediaTokenizerImpl.java:781)
at org.apache.lucene.analysis.wikipedia.WikipediaTokenizer.incrementToken(WikipediaTokenizer.java:200)
at org.apache.lucene.analysis.standard.StandardFilter.incrementToken(StandardFilter.java:49)
at org.apache.lucene.analysis.core.LowerCaseFilter.incrementToken(LowerCaseFilter.java:54)
at org.apache.lucene.analysis.util.FilteringTokenFilter.incrementToken(FilteringTokenFilter.java:50)
at org.apache.lucene.index.DocInverterPerField.processFields(DocInverterPerField.java:109)
at org.apache.lucene.index.DocFieldProcessor.processDocument(DocFieldProcessor.java:248)
at org.apache.lucene.index.DocumentsWriterPerThread.updateDocument(DocumentsWriterPerThread.java:253)
at org.apache.lucene.index.DocumentsWriter.updateDocument(DocumentsWriter.java:465)
at org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:1537)
at org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:1207)
at org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:1188)
at be.vanoosten.esa.WikiIndexer.index(WikiIndexer.java:156)
at be.vanoosten.esa.WikiIndexer.endElement(WikiIndexer.java:128)
at com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.endElement(Unknown Source)
at com.sun.org.apache.xerces.internal.xinclude.XIncludeHandler.endElement(Unknown Source)
at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanEndElement(Unknown Source)
at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl$FragmentContentDriver.next(Unknown Source)
at com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl.next(Unknown Source)
at com.sun.org.apache.xerces.internal.impl.XMLNSDocumentScannerImpl.next(Unknown Source)
at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanDocument(Unknown Source)
at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(Unknown Source)
at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(Unknown Source)
at com.sun.org.apache.xerces.internal.parsers.XMLParser.parse(Unknown Source)
at com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.parse(Unknown Source)
at com.sun.org.apache.xerces.internal.jaxp.SAXParserImpl$JAXPSAXParser.parse(Unknown Source)
at com.sun.org.apache.xerces.internal.jaxp.SAXParserImpl.parse(Unknown Source)
at javax.xml.parsers.SAXParser.parse(Unknown Source)
at be.vanoosten.esa.WikiIndexer.parseXmlDump(WikiIndexer.java:95)
at be.vanoosten.esa.Main.indexing(Main.java:175)
at be.vanoosten.esa.Main.main(Main.java:64)

restrict wiki

I have a question about ESA implementation. I want to apply this method in My Master Thesis, but i want restrict its knowlege Database (Only papers about Design Patterns). You can help me?:((

Lucene: exception - Query parser encountered <EOF> after “some word”

I got a problem when trying to read a dataset with special characters and trying to get the concept vector.
This is easily solve by adding the escape function in the Vectorizer class

public ConceptVector vectorize(String text) throws ParseException, IOException {
        Query query = queryParser.parse(**QueryParser.escape(text)**);
        TopDocs td = searcher.search(query, conceptCount);
        return new ConceptVector(td, indexReader);
    }

Great implementation by the way! Thanks

Source: https://stackoverflow.com/questions/10259907/lucene-exception-query-parser-encountered-eof-after-some-word/10259944

whose UTF8 encoding is longer than the max length 32766

Dear, I appreciate your contribution to the community.
I have started working with ESA, but when it runs it gives me problems: "Document contains at least one immense term in field =" text "(whose UTF8 encoding is longer than the maximum length 32766), all of which were skipped. The analyzer does not produce such terms The prefix of the first immense term is: '[6c 61 6c 61 6c 64 6b 6a 66 76 6e 74 75 69 76 62 79 6e 65 72 75 72 72 72 72 72 72 72 72 72] ... '"
I tried to change the .bz2 file and not (just to see if they had a problem). I hope you can help me.

accumulated size of entities is "50.000.001" that exceeded the "50.000.000" limit set by "FEATURE_SECURE_PROCESSING"

Hi,
i am trying to run your esa implementation. However i get an error during indexing this wiki dump:
"enwiki-20170320-pages-articles-multistream.xml.bz2". This is a 13,7GB dump by the way.
You can find them on this link: https://dumps.wikimedia.org/enwiki/20170320/

The errors says:

The accumulated size of entities is "50.000.001" that exceeded the "50.000.000" limit set by "FEATURE_SECURE_PROCESSING".

I wonder if you ran into the same problem and if you know how i can set the limit or disable them altogether. Thank you in advance.

Apr 18, 2017 1:02:12 PM be.vanoosten.esa.WikiIndexer parseXmlDump
SEVERE: null
org.xml.sax.SAXParseException; lineNumber: 64243259; columnNumber: 371; JAXP00010004: The accumulated size of entities is "50.000.001" that exceeded the "50.000.000" limit set by "FEATURE_SECURE_PROCESSING".
	at com.sun.org.apache.xerces.internal.util.ErrorHandlerWrapper.createSAXParseException(Unknown Source)
	at com.sun.org.apache.xerces.internal.util.ErrorHandlerWrapper.fatalError(Unknown Source)
	at com.sun.org.apache.xerces.internal.impl.XMLErrorReporter.reportError(Unknown Source)
	at com.sun.org.apache.xerces.internal.impl.XMLErrorReporter.reportError(Unknown Source)
	at com.sun.org.apache.xerces.internal.impl.XMLScanner.reportFatalError(Unknown Source)
	at com.sun.org.apache.xerces.internal.impl.XMLScanner.checkEntityLimit(Unknown Source)
	at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.handleCharacter(Unknown Source)
	at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanEntityReference(Unknown Source)
	at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl$FragmentContentDriver.next(Unknown Source)
	at com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl.next(Unknown Source)
	at com.sun.org.apache.xerces.internal.impl.XMLNSDocumentScannerImpl.next(Unknown Source)
	at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanDocument(Unknown Source)
	at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(Unknown Source)
	at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(Unknown Source)
	at com.sun.org.apache.xerces.internal.parsers.XMLParser.parse(Unknown Source)
	at com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.parse(Unknown Source)
	at com.sun.org.apache.xerces.internal.jaxp.SAXParserImpl$JAXPSAXParser.parse(Unknown Source)
	at com.sun.org.apache.xerces.internal.jaxp.SAXParserImpl.parse(Unknown Source)
	at javax.xml.parsers.SAXParser.parse(Unknown Source)
	at be.vanoosten.esa.WikiIndexer.parseXmlDump(WikiIndexer.java:95)
	at be.vanoosten.esa.Main.indexing(Main.java:227)
	at be.vanoosten.esa.Main.createEngIndex(Main.java:102)
	at be.vanoosten.esa.Main.main(Main.java:61)

Similarity of identical strings

Hi,

I set everything up for both the english and french version of wikipedia, and everything works fine (in the sense that everything compiles and runs), but I made some tests and I think I am obtaining weird results, i.e., counter-intuitive values of similarity on several texts with both languages, and low similarity when comparing a string with itself...is that normal? (I guess no)

Thanks,
Roberto

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.