GithubHelp home page GithubHelp logo

Comments (4)

mosart avatar mosart commented on July 21, 2024

Hi Erik,

The orgin for rendering the XML to HTML can be found in the XSL
https://github.com/Aurora-Network-Global/sdg-queries/blob/master/queries.xsl

I tried to fix that by adding a non-breaking space after the OR statement in the XSL template.
f160f07#diff-34b02004b91728a359a521c831ffef74d95aebdd3b9c03788288b8e9aaa4bcb6

However, then I tried this, the rendering of the HTML breaks completely. (I could not test it on my laptop. Somehow a modern version of a web browser does not render xml to html when opened on localhost but only when accessed via https.)
So I changed it back.

If you have a -tested- solution for the forced space after the OR statement in the XSL. please help me out.

Warm regards,
Maurice

from sdg-queries.

erikkemperman avatar erikkemperman commented on July 21, 2024

Hi Maurice,

Thanks for taking a look at the issue -- I am not sure how to remedy it I'm afraid, and have changed my approach since I reported this. I am now just parsing the raw XML files and compositing the scopus queries in a Python script. This suits me better anyway, since my goal is to transform the queries to work on Postgres.

Regards,
Erik

from sdg-queries.

mosart avatar mosart commented on July 21, 2024

Nice! that is why we put it in xml, for automation, and human readability.
If you want you can let me know more about your project, and perhaps also share the transamination script. ( like IDfuse did for Elastic search DSL.
You are working at Erasmus University right?

from sdg-queries.

erikkemperman avatar erikkemperman commented on July 21, 2024

Yes, I am an RSEC at Erasmus!

Agreed that XML is a nice format for this kind of thing -- although to facilitate translation of the queries to other languages, it might be worthwhile to consider making things a bit finer-grained, and perhaps slightly less Scopus-centric (although I understand those are the origins).

Just as an example,

<aqd:query-line field="TITLE-ABS-KEY">
  ("poverty line*") OR ("poverty indicator*")
</aqd:query-line>

To write a script to translate this to other query languages, I need to parse first the XML and then the Scopus query (for which, to my knowledge, no explicit grammar is publicly available so I've had to cobble something together myself using Antlr).

Suppose, instead, the XML looked something like this:

<aqd:query-line field="TITLE-ABS-KEY">
  <aqg:query-or>
    <aqd:query-parens>
      "poverty line*"
    </aqd:query-parens>
    <aqd:query-parens>
      "poverty indicator*"
    </aqd:query-parens>
  </aqg:query-or>
</aqd:query-line>

That way the tree structure of the query is reflected explicitly in XML, and it would be much easier to transform to other query languages. Of course the XSLT to render the Scopus queries would become a bit more complicated. Now that I have a Antlr grammar that appears to correctly parse the Scopus trees, I suppose it would be pretty easy to use that to automatically transform the former to the latter XML, so that wouldn't have to be done manually.

As an aside, I'm beginning to regret the choice (not mine) for Postgres. The argument at the time was that it supports something like Scopus' W/N proximity operator. But playing around with this, and reading up on Postgres' <N> operator, it's actually subtly different.

For one thing, the Scopus proximity operator is not directional, i.e. A W/3 B matches the same documents as B W/3 A. This is not true in Postgres, so to get an equivalent query I have to emit extra clauses, e.g. (A <3> B) || (B <3> A). (*)

Another complication is that Scopus' <W/3> means "within 3 or fewer words/lexemes" but the Postgres operator is exact. So actually, the equivalent of A W/3 B would be something like (A <1> B) || (B <1> A) || (A <2> B) || (B <2> A) || (A <3> B) || (B <3> A).

Of course, these problems compound very quickly if multiple proximity operators occur in a single query: if I am given A W/3 B W/3 C I will have to emit clauses for each permutation of A, B, and C (6 of them) as well as the cartesian product of the two ranges 1, 2, 3 (9 of them) for a total of 54 (!) clauses. And this is a trivial example, you can imagine I am ending up with some gigantic queries for the real thing!

Finally, I end up not using the more advanced features of Postgres text search, and in fact I have to force it to "simple" mode in order to make the Scopus wildcards work. Postgres would like to help me with this, stemming words in the documents and queries for me, ignoring stop words, and leveraging a built-in thesaurus for synonyms.

But the way the Scopus queries are given here defeats this, for example eradicat* occurs in the Scopus queries but since that isn't a known word, Postgres doesn't know how to stem it -- and so unless I force it to simple mode, a document with the word eradicate or eradication will not match this query, because it will have stemmed the valid word in the document but not the term in the query...

I can imagine, although it will be a lot of work, enriching the Aurora XML with a few valid expansions of the wild-carded terms. That way I can use those in my Postgres queries and have it do its magic.

Anyway, I have to get on with the next phase and unfortunately can't linger on these issues. Just thought I'd mention these observations while they are fresh on my mind. If I have a bit more time, I might revisit this if you are interested and try to come up with some more constructive / concrete proposals.

(*) Incidentally, Scopus does also have a directed variant, PRE/N and I wonder if some of the Aurora queries would be more precisely expressed that way.

from sdg-queries.

Related Issues (4)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.