marklogic-community / corona Goto Github PK

View Code? Open in Web Editor NEW

37.0 37.0 9.0 39.37 MB

Community REST API for MarkLogic

License: Other

Perl 0.16% JavaScript 38.79% XQuery 55.79% Python 5.25%

corona's People

Contributors

Stargazers

Watchers

Forkers

grtjn ryangrimm scottconroy darobin dahabit sreenathm mallorymegan1984 melkis7

corona's Issues

Destroy should return uris of deleted documents

In the previous version we were retuning uris only, now only metadata is returned:

  Nuno-Jobs-MacBook-Pro:nuvem njob$ NODE_ENV=development node test/json/destroy.js 
  { method: 'PUT',
    headers: { 'content-type': 'application/json' },
    uri: 'http://admin:admin@localhost:4380/json/store/i_bulk_custom_query',
    body: '{"dino":false}' }
  { method: 'PUT',
    headers: { 'content-type': 'application/json' },
    uri: 'http://admin:admin@localhost:4380/json/store/are_bulk_custom_query',
    body: '{"dino":false}' }
  { method: 'PUT',
    headers: { 'content-type': 'application/json' },
    uri: 'http://admin:admin@localhost:4380/json/store/dino_bulk_custom_query',
    body: '{"dino":"RWAR"}' }
  { method: 'DELETE',
    headers: { 'content-type': 'application/json' },
    uri: 'http://admin:admin@localhost:4380/json/store?customquery=%7B%22equals%22%3A%7B%22key%22%3A%22dino%22%2C%22value%22%3A%22RWAR%22%7D%7D&bulkDelete=true' }
  ✗ 

    bulk_custom_query
      ✗ ok
        » expected {
      deleted: 1,
      numRemaining: 0,
      uris: [ '/dino_bulk_custom_query' ]
  },
    got  {
      meta: { deleted: 1, numRemaining: 0 }
  } (==) // destroy.js:46
  ✗ Broken » 1 broken (0.309s)

RFE: Get search results for XML documents in JSON format

I recall this being asked for in internal review. Buxton just asked for it as well.

If people want to include the documents in the results, we have JSON extensions to handle that.

More robust error handling on insert

If you do a put to store with no body in the request, the request eventually times out as the code has errored out on the server. FOr example, posting to /store?uri=foo.xml with no body provides the following stack trace on the server. On the client the request just times out.

I have witnessed this type of poor error handling in other places as well, such as posting invalid fields to a search request. I think that the error handling in general should be reviewed.

Request Stack Trace
Current Expression:
/corona/lib/json.xqy: 156 map:map()
Global Variables
$analyzeString = xdmp:function(xs:QName("fn:analyze-string"))
$isSupported = fn:true()
Stack Trace:
/corona/lib/json.xqy: 156 json:object(("status", 500, "code", ...))
Local Variables
$keyValues = ("status", 500, "code", ...)
/corona/lib/common.xqy: 75 common:error("corona:INTERNAL-ERROR", "Invalid coercion (XDMP-AS: (err:XPTY0004) $content as xs:string ...", "json")
Local Variables
$exceptionCode = "corona:INTERNAL-ERROR"
$message = "Invalid coercion (XDMP-AS: (err:XPTY0004) $content as xs:string ..."
$outputFormat = "json"
$isA400 = ("corona:DUPLICATE-INDEX-NAME", "corona:DUPLICATE-PLACE-ITEM", "corona:REQUIRES-BULK-DELETE")
$isA500 = ("corona:UNSUPPORTED-METHOD", "corona:INTERNAL-ERROR")
$statusCode = 500
$set = ()
$add = ()
/corona/lib/common.xqy: 95 common:errorFromException(<error:error xsi:schemaLocation="http://marklogic.com/xdmp/error error.xsd" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:error="http://marklogic.com/xdmp/error">error:codeXDMP-AS/error:codeerror:nameerr:XPTY0004/error:...</error:error, "json")
Local Variables
$exception = <error:error xsi:schemaLocation="http://marklogic.com/xdmp/error error.xsd" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:error="http://marklogic.com/xdmp/error">error:codeXDMP-AS/error:codeerror:nameerr:XPTY0004/error:...</error:error
$outputFormat = "json"
/corona/store.xqy: 162 (no expression source available)
Local Variables
$requestMethod = "PUT"
$params = map:map(<map:map xmlns:xs="http://www.w3.org/2001/XMLSchema" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:map="http://marklogic.com/xdmp/map"><map:entry key="uri"><map:value xsi:type="xs:string">/books/a_an.../map:map)
$uri = "/books/a_and_c.xml"
$txid = ()
$outputFormat = "json"
$errors = ()
$e = <error:error xsi:schemaLocation="http://marklogic.com/xdmp/error error.xsd" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:error="http://marklogic.com/xdmp/error">error:codeXDMP-AS/error:codeerror:nameerr:XPTY0004/error:...</error:error

Examine search result format vs Search API format

The two formats should be as similar as possible while keeping complexity down.

Error Handler

We need a MarkLogic error handler to send errors from MarkLogic in either XML or JSON

setup throws an error

When I hit the setup.xqy at http://localhost:8010/config/setup.xqy I get the following error. The framework otherwise seems to be functioning fine, indicating that I have the correct configuration for the url rewriter.
{"error":{"status":404,"code":"corona:ENDPOINT-NOT-FOUND","message":"Invalid endpoint. Check path and parameters for errors."}}

coercion error when collection parameter is used on PUT

dfeldman@hp001 /cygdrive/c/Program Files/MarkLogic/Data/Logs

$ curl -X PUT 'http://admin:admin@localhost:5010/store?uri=/foo/bar2.xml?collection=c' --data ''

<corona:error xmlns:corona="http://marklogic.com/corona">corona:status500/corona:statuscorona:codecorona:INTERNAL-ERROR/corona:codecorona:messageInvalid coercion (XDMP-AS: (err:XPTY0004) $contentType as xs:string -- Invalid coercion: () as xs:string)/corona:message/corona:error

works fine without the ?collection=c param.

RFE: Facet on collections and uris

It'd be good for people to be able to facet on collections and uris, and use them like they'd use any other range index value. This lets people get the collections or URIs that match a query without touching the disk. Perhaps we should have people enable them like they do ranges?

Corona setup page confusing

The conona setup page, accessed at /config/setup has two linkes. The bottom link takes you to a page that does the corona user setup. The top link goes to /config/setup.xqy and doesn't seem to do anything. Why is it a link?

RFE: Analytical functions against facets

We could support a few analytical functions against ranges:

Median and other percentiles
Mean average
Standard deviation
Sum

These will run much faster if you don't have to transmit the full batch of data from the server to client.

RFE: Deeper geo query support

MarkLogic 5 introduces complex polygon constraints, for things like donuts. That's something we could pretty easily allow a structured query to declare.

There's other geo primitives like "distance" that maybe should be exposed somehow. If you want to sort by distance, for example, that could be handy. I'm not sure how exactly to expose that, but it seems valuable.

RFE: Server-Side Update

As in

function updateUser(ghUser) {
  console.log("Trying to update github user " + ghUser.login);
  db.json.update_first(
     { github_login: ghUser.login })
       .increment('version', 1)
       .replace('github_login', ghUser.login)
       .replace('github', ghUser)
       .delete('linkedin.deletemeplease')
       .rename('meeutp','meetup')
       .push('anarray', 1)
       .addCollection('github') // also replaceCollection
       .setQuality(9)
       .replacePermissions(['a','b','c']) // also addPermissions
       .save(
         function saveCb(e) {
           if(e) {
             console.log("Couldn't update " + ghUser.login);
             return;
           }
           console.log(ghUser.login + " updated");
           return;
       });
};

Strange response to query

Request

  http://localhost:4380/json/query?q=fox

Response

{"meta":{"start":1,"end":1,"total":2},"results":[{"uri":"/foo/another_snow","content":{"foo":"to find something you could eat fox"}}]}

Question: If I didn't state start and end shouldnt this return both results that match?

RFE: Ability to get count without results

A pretty common thing is to just get a count of docs matching a query. We should make sure that's possible with the various query interfaces, and possible in a way that regular folks will figure out.

RFE: Reverse queries

At some point Corona should support reverse queries.

I picture letting the user save queries with names. Then execute a reverse query on each document insert or modification (easy to do when in a managed context, could still use triggers and CPF when not). Let the user indicate what should happen on a new match. Expose some decent primitives for them to choose from: invoke a url, email out, make note in a way that can be polled later, run some uploaded code, etc.

RFE: Proximity boosting

If someone searches for several words but doesn't use quotes, we should be able to boost results that have those words together as a phrase or in close proximity. Maybe as a configurable setting.

It can be implemented with an or-query combination of phrase queries and near queries as well as separate word queries, and ML has some built-in boosting support as well.

RFE: Give custom query the true-query and false-query equivalents

Using cts:query people sometimes write cts:and-query(()) to indicate the equivalent of the non-existent cts:true-query() and cts:or-query(()) for the non-existent cts:false-query(). I think Corona should provide the true/false query primitives so this is more approachable to people.

Geo searches need geo indexes

The structured query construct supports declaring geographic queries, but there's no way via Corona to declare the geo indexes that need to be there to support them.

Weight on geo is currently ignored, should prob not expose

http://developer.marklogic.com/pubs/5.0/apidocs/cts-query.html#cts:element-geospatial-query says

"$weight (optional): A weight for this query. The default is 1.0. This option is currently ignored; geospatial queries do not contribute to the score."

So I'd remove that feature from Corona until such time as we can implement it. It's OK with me if you keep the code the same, but remove it from the docs.

RFE: The /search endpoint should support POST

Some structured queries will be very large. Should be able to POST them.

In the docs we should make clear you post the params, one of which is JSON or XML, and you don't post JSON or XML as its own request body.

Should our JSON Path borrow ideas from JSONiq?

An investigation topic.

Doc write doesn't set perms so current user can see it

When writing a document Corona should set it so that the current user can read/update/insert/execute the file. Right now Corona can't see the files it's written unless running as admin.

Yeah, probably even execute so you can use Corona to manage your module code.

I'm thinking current user with all their roles is better than just the role corona-dev.

Docs don't explain how to list managed items

The docs for namespaces, places, ranges, buckets, places, and transformers don't explain how to get a list of current ones.

websockets ...

I wonder how websockets would get supported with Corona ... any ideas ?

RFE: Support binary documents

It would be good to support storing binary documents.

It would also be good to optionally extract metadata and text content from them and make it available for query.

RFE: Support parsing more date locales

Today the date parser understands many different styles, in 5 languages (and tested against millions of MarkMail messages). We should probably support all the languages where MarkLogic has enhanced support.

It should be fairly easy to generate the match strings. Write a Java program and use its SimpleDateFormatter to print dates in all the different styles, then adjust the locale to get examples for all the necessary languages.

RFE: Let Corona Admins configure index settings

Today control over index features like case sensitivity are only available to database admins. That probably should be available to corona admins via web service endpoints.

I think we should probably let Corona admins set the same options as MarkLogic fields let you set:

stemmed searches (combo)
word searches
fast phrase searches
fast case sensitive searches
fast diacritic sensitive searches
trailing wildcard searches
trailing wildcard word positions
three character searches
three character word positions
two character searches
one character searches

Then on Place definitions, you can optionally set these same ones as overrides. Handy.

RFE: Add internal Corona security roles

To secure access to Corona and make sure casual developers don't mess with the internals of the Corona managed context, I envision having three roles: corona-dev, corona-admin, and corona-internal.

Web endpoint access will require corona-dev (for regular endpoints) or corona-admin (for the management endpoints). The corona-admin role will inherit corona-dev.

Document access will require corona-internal, which is a role no actual users should have but which the internal Corona code amps itself to have during document storage and retrieval calls. This keeps regular users, even corona-dev users, from directly accessing the files managed by Corona without going through Corona's "business logic".

Is that good? Well, running a managed context has downsides, true, but bigger upsides I think. It lets us do reliable metadata tracking (is a saved file considered XML or JSON?), auditing, quota enforcement, implicit a consistent hashing distribution, and so on. The list is pretty long. It also means we can keep regular users from seeing JSON files in their raw XML serialization and wrongly issuing XPaths against a format that might change.

Since users should have XQuery-level access to the docs managed by Corona, they'll need XQuery-level APIs. There'll be a corona:doc("foo.json") for example. This knows to fetch the file back as JSON. Internally it's calls like this that will amp to the corona-internal role to allow it to see the raw XML.

RFE: Update RecordLoader to speak Corona's protocol

This would be convenient for people doing bulk uploads.

It will also let us do a fair comparison of performance between the Corona vs XDBC protocols.

For performance we should probably group documents into small batches (~32 docs), using multi-statement transactions and/or multiple documents uploaded together.

RFE: State and database clear for unit testing

Request from Clark. He'd like a way to clear the full Corona state to better support unit testing.

Perhaps you should be able to clear each managed item (places, namespaces, etc) as well as clear them all.

We should probably support clearing the database too. (What about non-Corona data in the database?)

RFE: Implement consistent hashing distribution

Background: Normally MarkLogic does its own assignment of documents to forests and sends messages to all D-nodes when doing a document retrieval. Forest placement is the act of telling MarkLogic explicitly in which forest to place a document. In-forest eval is the act of limiting a query to a particular forest (or forests), which is commonly done when that forest is known to contain the document being retrieved due to previous forest placement. On large clusters these techniques can help with scaling. Documents are often assigned to forests using a consistent hashing algorithm on URIs.

The challenge of consistent hashing is handling the case when the topology of the forests changes (i.e. when a new forest is added). But by adding a level of indirection (essentially hashing documents into buckets and tracking which buckets are assigned to which forests) you can handle a new topology by moving bucket assignments to new D-nodes and maintaining a memory of which buckets are where. Hash -> bucket -> forest.

You can implement this in pure XQuery. It won't be invisible to the user though because the XQuery programmer will need to use custom store and retrieve calls that are hash-aware. Moreover, when loading from XCC the client needs to know about buckets, which is inelegant.

With Corona we can do it all effectively and invisibly. All doc stores would go through the hash -> bucket -> forest assignment. Doc retrievals also. Moreover, Corona could also do background rebalancing as new forests are added by moving buckets of documents to the new forests. That could be done automatically or via a web call.

Being a fully managed context has its advantages.

RFE: Bulk retrieval of URIs respecting passed-in URI order

An issue raised by Fernando. He'd like to be able to request a set of documents at once by URI, and get the results back in the same order as the URIs were provided.

minor rendering issue in blog nav

Open up 2011, January - you will notice two horizontal lines below January - probably should be only 1.

RFE: Enable safe browser/Corona communication

People are going to pretty quickly want to talk to Corona directly from browsers. But that's not really safe. A malicious browser gets full access to the Corona data. We should think about ways to make that architecture safe.

Examples are the parse.com REST APIs and security model.

Query should return confidence

Any reason why we don't return confidence and fitness for each result? Seems like it could help people improve their queries

RFE: Filtered vs unfiltered search option

We should give developers the ability to do filtered or unfiltered searches. The default should probably be unfiltered, but when you need filtering you really need it. We just need to be able to explain it to non-experts. Here's a stab at that:

"The default search behavior is 'unfiltered' where results are returned based solely on index resolution. The optional 'filter' flag indicates each result should additionally be examined to verify it matches the query before being returned. If your query includes a constraint that won't reliably resolve from indexes, this will ensure you get accurate results. Beware that the query execution may take longer to process, possibly a very long time if most results returned from indexes aren't truly matches."

RFE: Search suggest

Search suggest is something to consider, in support of the query string service.

RFE: Add elementExists constructor

With custom query service there's a keyExists constructor for JSON. Seems like there should be equivalents for element or element-attributes in XML documents.

RFE: An xpath-based search constraint

Some structure-driven constraints are easier/shorter to express in XPath than in the string and structured query syntax we support right now.

Perhaps we should let the user specify XPath expressions to limit documents, something like an xpathQuery parameter. For JSON documents we could use JSONPath.

We'd want the XPath to be fairly expressive (i.e. supporting complex predicates).

Open question: Do we only support searchable expression xpaths? Do we allow filtering?

RFE: Support text documents

General internal consensus seems to be that text documents would be good to support.

Shell script for setup

Yes.

The HTML page I cooked up was an easy first step that provided a nice looking out-of-box experience. Should file an RFE for a command line driven version as well.

--Ryan

On Nov 7, 2011, at 9:22 AM, Eric Bloch wrote:

Can we get a shell script for setup? Even if all it does is call curl...shell script can take command line options for credentials...

-Eric

Doc insert with no string provided should not return 200

The following request returns 200, with no indication that the document insert has failed (and it does fail).

xdmp:http-put("http://localhost:8004/xml/store",

{xdmp:quote("mydoc")}

user
pass

)

Seems like it ought to return 400.

RFE: String query against properties

People may want to query against properties using the string query syntax.

Wonder if we could make properties just another aspect of a Place.

RFE: Include some standard XSLT templates

I'm thinking of things like Norm's DocBook stylesheets (http://wiki.docbook.org/DocBookXslStylesheetDocs). If there's an XSLT library that's generally useful, high quality, and with a friendly license, we could include them to make a more complete solution.

Could also include a stylesheet to build a new custom document based on a manifest (a Chris Welch request).

RFE: Include elapsed-time in results

Seems like it may help people to have an elapsed time in the responses. For example: It lets you compare different queries reliably without concern for network latency overhead. It lets you record historic performance to a log.

At minimum this would be good on search results, but anything where there's a structured response it seems useful. And cheap on the MarkLogic side.

RFE: State serialization and reloading

It'd be useful to be able to extract a summary of Corona's state (namespaces, places, ranges, etc) in a singular serialized format such as JSON, for archive purposes, and then later push that state back using the same format.

It's similar to what MarkLogic 5 provided for the full system configuration, but this Corona version should be much more minimal and only focused on things a "corona admin" should see and maintain and should be available to those with only the corona-admin role. The MarkLogic 5 feature is for real "database admins".

marklogic-community / corona Goto Github PK

corona's People

Contributors

Stargazers

Watchers

Forkers

corona's Issues

Recommend Projects

Recommend Topics

Recommend Org

Jobs