marklogic-community / corona Goto Github PK
View Code? Open in Web Editor NEWCommunity REST API for MarkLogic
License: Other
Community REST API for MarkLogic
License: Other
In the previous version we were retuning uris only, now only metadata is returned:
Nuno-Jobs-MacBook-Pro:nuvem njob$ NODE_ENV=development node test/json/destroy.js
{ method: 'PUT',
headers: { 'content-type': 'application/json' },
uri: 'http://admin:admin@localhost:4380/json/store/i_bulk_custom_query',
body: '{"dino":false}' }
{ method: 'PUT',
headers: { 'content-type': 'application/json' },
uri: 'http://admin:admin@localhost:4380/json/store/are_bulk_custom_query',
body: '{"dino":false}' }
{ method: 'PUT',
headers: { 'content-type': 'application/json' },
uri: 'http://admin:admin@localhost:4380/json/store/dino_bulk_custom_query',
body: '{"dino":"RWAR"}' }
{ method: 'DELETE',
headers: { 'content-type': 'application/json' },
uri: 'http://admin:admin@localhost:4380/json/store?customquery=%7B%22equals%22%3A%7B%22key%22%3A%22dino%22%2C%22value%22%3A%22RWAR%22%7D%7D&bulkDelete=true' }
✗
bulk_custom_query
✗ ok
» expected {
deleted: 1,
numRemaining: 0,
uris: [ '/dino_bulk_custom_query' ]
},
got {
meta: { deleted: 1, numRemaining: 0 }
} (==) // destroy.js:46
✗ Broken » 1 broken (0.309s)
I recall this being asked for in internal review. Buxton just asked for it as well.
If people want to include the documents in the results, we have JSON extensions to handle that.
If you do a put to store with no body in the request, the request eventually times out as the code has errored out on the server. FOr example, posting to /store?uri=foo.xml with no body provides the following stack trace on the server. On the client the request just times out.
I have witnessed this type of poor error handling in other places as well, such as posting invalid fields to a search request. I think that the error handling in general should be reviewed.
Request Stack Trace
Current Expression:
/corona/lib/json.xqy: 156 map:map()
Global Variables
$analyzeString = xdmp:function(xs:QName("fn:analyze-string"))
$isSupported = fn:true()
Stack Trace:
/corona/lib/json.xqy: 156 json:object(("status", 500, "code", ...))
Local Variables
$keyValues = ("status", 500, "code", ...)
/corona/lib/common.xqy: 75 common:error("corona:INTERNAL-ERROR", "Invalid coercion (XDMP-AS: (err:XPTY0004) $content as xs:string ...", "json")
Local Variables
$exceptionCode = "corona:INTERNAL-ERROR"
$message = "Invalid coercion (XDMP-AS: (err:XPTY0004) $content as xs:string ..."
$outputFormat = "json"
$isA400 = ("corona:DUPLICATE-INDEX-NAME", "corona:DUPLICATE-PLACE-ITEM", "corona:REQUIRES-BULK-DELETE")
$isA500 = ("corona:UNSUPPORTED-METHOD", "corona:INTERNAL-ERROR")
$statusCode = 500
$set = ()
$add = ()
/corona/lib/common.xqy: 95 common:errorFromException(<error:error xsi:schemaLocation="http://marklogic.com/xdmp/error error.xsd" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:error="http://marklogic.com/xdmp/error">error:codeXDMP-AS/error:codeerror:nameerr:XPTY0004/error:...</error:error, "json")
Local Variables
$exception = <error:error xsi:schemaLocation="http://marklogic.com/xdmp/error error.xsd" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:error="http://marklogic.com/xdmp/error">error:codeXDMP-AS/error:codeerror:nameerr:XPTY0004/error:...</error:error
$outputFormat = "json"
/corona/store.xqy: 162 (no expression source available)
Local Variables
$requestMethod = "PUT"
$params = map:map(<map:map xmlns:xs="http://www.w3.org/2001/XMLSchema" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:map="http://marklogic.com/xdmp/map"><map:entry key="uri"><map:value xsi:type="xs:string">/books/a_an.../map:map)
$uri = "/books/a_and_c.xml"
$txid = ()
$outputFormat = "json"
$errors = ()
$e = <error:error xsi:schemaLocation="http://marklogic.com/xdmp/error error.xsd" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:error="http://marklogic.com/xdmp/error">error:codeXDMP-AS/error:codeerror:nameerr:XPTY0004/error:...</error:error
The two formats should be as similar as possible while keeping complexity down.
We need a MarkLogic error handler to send errors from MarkLogic in either XML or JSON
When I hit the setup.xqy at http://localhost:8010/config/setup.xqy I get the following error. The framework otherwise seems to be functioning fine, indicating that I have the correct configuration for the url rewriter.
{"error":{"status":404,"code":"corona:ENDPOINT-NOT-FOUND","message":"Invalid endpoint. Check path and parameters for errors."}}
dfeldman@hp001 /cygdrive/c/Program Files/MarkLogic/Data/Logs
$ curl -X PUT 'http://admin:admin@localhost:5010/store?uri=/foo/bar2.xml?collection=c' --data ''
<corona:error xmlns:corona="http://marklogic.com/corona">corona:status500/corona:statuscorona:codecorona:INTERNAL-ERROR/corona:codecorona:messageInvalid coercion (XDMP-AS: (err:XPTY0004) $contentType as xs:string -- Invalid coercion: () as xs:string)/corona:message/corona:error
works fine without the ?collection=c param.
It'd be good for people to be able to facet on collections and uris, and use them like they'd use any other range index value. This lets people get the collections or URIs that match a query without touching the disk. Perhaps we should have people enable them like they do ranges?
The conona setup page, accessed at /config/setup has two linkes. The bottom link takes you to a page that does the corona user setup. The top link goes to /config/setup.xqy and doesn't seem to do anything. Why is it a link?
We could support a few analytical functions against ranges:
Median and other percentiles
Mean average
Standard deviation
Sum
These will run much faster if you don't have to transmit the full batch of data from the server to client.
MarkLogic 5 introduces complex polygon constraints, for things like donuts. That's something we could pretty easily allow a structured query to declare.
There's other geo primitives like "distance" that maybe should be exposed somehow. If you want to sort by distance, for example, that could be handy. I'm not sure how exactly to expose that, but it seems valuable.
As in
function updateUser(ghUser) {
console.log("Trying to update github user " + ghUser.login);
db.json.update_first(
{ github_login: ghUser.login })
.increment('version', 1)
.replace('github_login', ghUser.login)
.replace('github', ghUser)
.delete('linkedin.deletemeplease')
.rename('meeutp','meetup')
.push('anarray', 1)
.addCollection('github') // also replaceCollection
.setQuality(9)
.replacePermissions(['a','b','c']) // also addPermissions
.save(
function saveCb(e) {
if(e) {
console.log("Couldn't update " + ghUser.login);
return;
}
console.log(ghUser.login + " updated");
return;
});
};
Request
http://localhost:4380/json/query?q=fox
Response
{"meta":{"start":1,"end":1,"total":2},"results":[{"uri":"/foo/another_snow","content":{"foo":"to find something you could eat fox"}}]}
Question: If I didn't state start and end shouldnt this return both results that match?
A pretty common thing is to just get a count of docs matching a query. We should make sure that's possible with the various query interfaces, and possible in a way that regular folks will figure out.
At some point Corona should support reverse queries.
I picture letting the user save queries with names. Then execute a reverse query on each document insert or modification (easy to do when in a managed context, could still use triggers and CPF when not). Let the user indicate what should happen on a new match. Expose some decent primitives for them to choose from: invoke a url, email out, make note in a way that can be polled later, run some uploaded code, etc.
If someone searches for several words but doesn't use quotes, we should be able to boost results that have those words together as a phrase or in close proximity. Maybe as a configurable setting.
It can be implemented with an or-query combination of phrase queries and near queries as well as separate word queries, and ML has some built-in boosting support as well.
Using cts:query people sometimes write cts:and-query(()) to indicate the equivalent of the non-existent cts:true-query() and cts:or-query(()) for the non-existent cts:false-query(). I think Corona should provide the true/false query primitives so this is more approachable to people.
The structured query construct supports declaring geographic queries, but there's no way via Corona to declare the geo indexes that need to be there to support them.
http://developer.marklogic.com/pubs/5.0/apidocs/cts-query.html#cts:element-geospatial-query says
"$weight (optional): A weight for this query. The default is 1.0. This option is currently ignored; geospatial queries do not contribute to the score."
So I'd remove that feature from Corona until such time as we can implement it. It's OK with me if you keep the code the same, but remove it from the docs.
Some structured queries will be very large. Should be able to POST them.
In the docs we should make clear you post the params, one of which is JSON or XML, and you don't post JSON or XML as its own request body.
An investigation topic.
When writing a document Corona should set it so that the current user can read/update/insert/execute the file. Right now Corona can't see the files it's written unless running as admin.
Yeah, probably even execute so you can use Corona to manage your module code.
I'm thinking current user with all their roles is better than just the role corona-dev.
The docs for namespaces, places, ranges, buckets, places, and transformers don't explain how to get a list of current ones.
I wonder how websockets would get supported with Corona ... any ideas ?
It would be good to support storing binary documents.
It would also be good to optionally extract metadata and text content from them and make it available for query.
Today the date parser understands many different styles, in 5 languages (and tested against millions of MarkMail messages). We should probably support all the languages where MarkLogic has enhanced support.
It should be fairly easy to generate the match strings. Write a Java program and use its SimpleDateFormatter to print dates in all the different styles, then adjust the locale to get examples for all the necessary languages.
Today control over index features like case sensitivity are only available to database admins. That probably should be available to corona admins via web service endpoints.
I think we should probably let Corona admins set the same options as MarkLogic fields let you set:
stemmed searches (combo)
word searches
fast phrase searches
fast case sensitive searches
fast diacritic sensitive searches
trailing wildcard searches
trailing wildcard word positions
three character searches
three character word positions
two character searches
one character searches
Then on Place definitions, you can optionally set these same ones as overrides. Handy.
To secure access to Corona and make sure casual developers don't mess with the internals of the Corona managed context, I envision having three roles: corona-dev, corona-admin, and corona-internal.
Web endpoint access will require corona-dev (for regular endpoints) or corona-admin (for the management endpoints). The corona-admin role will inherit corona-dev.
Document access will require corona-internal, which is a role no actual users should have but which the internal Corona code amps itself to have during document storage and retrieval calls. This keeps regular users, even corona-dev users, from directly accessing the files managed by Corona without going through Corona's "business logic".
Is that good? Well, running a managed context has downsides, true, but bigger upsides I think. It lets us do reliable metadata tracking (is a saved file considered XML or JSON?), auditing, quota enforcement, implicit a consistent hashing distribution, and so on. The list is pretty long. It also means we can keep regular users from seeing JSON files in their raw XML serialization and wrongly issuing XPaths against a format that might change.
Since users should have XQuery-level access to the docs managed by Corona, they'll need XQuery-level APIs. There'll be a corona:doc("foo.json") for example. This knows to fetch the file back as JSON. Internally it's calls like this that will amp to the corona-internal role to allow it to see the raw XML.
This would be convenient for people doing bulk uploads.
It will also let us do a fair comparison of performance between the Corona vs XDBC protocols.
For performance we should probably group documents into small batches (~32 docs), using multi-statement transactions and/or multiple documents uploaded together.
Request from Clark. He'd like a way to clear the full Corona state to better support unit testing.
Perhaps you should be able to clear each managed item (places, namespaces, etc) as well as clear them all.
We should probably support clearing the database too. (What about non-Corona data in the database?)
Background: Normally MarkLogic does its own assignment of documents to forests and sends messages to all D-nodes when doing a document retrieval. Forest placement is the act of telling MarkLogic explicitly in which forest to place a document. In-forest eval is the act of limiting a query to a particular forest (or forests), which is commonly done when that forest is known to contain the document being retrieved due to previous forest placement. On large clusters these techniques can help with scaling. Documents are often assigned to forests using a consistent hashing algorithm on URIs.
The challenge of consistent hashing is handling the case when the topology of the forests changes (i.e. when a new forest is added). But by adding a level of indirection (essentially hashing documents into buckets and tracking which buckets are assigned to which forests) you can handle a new topology by moving bucket assignments to new D-nodes and maintaining a memory of which buckets are where. Hash -> bucket -> forest.
You can implement this in pure XQuery. It won't be invisible to the user though because the XQuery programmer will need to use custom store and retrieve calls that are hash-aware. Moreover, when loading from XCC the client needs to know about buckets, which is inelegant.
With Corona we can do it all effectively and invisibly. All doc stores would go through the hash -> bucket -> forest assignment. Doc retrievals also. Moreover, Corona could also do background rebalancing as new forests are added by moving buckets of documents to the new forests. That could be done automatically or via a web call.
Being a fully managed context has its advantages.
An issue raised by Fernando. He'd like to be able to request a set of documents at once by URI, and get the results back in the same order as the URIs were provided.
Open up 2011, January - you will notice two horizontal lines below January - probably should be only 1.
People are going to pretty quickly want to talk to Corona directly from browsers. But that's not really safe. A malicious browser gets full access to the Corona data. We should think about ways to make that architecture safe.
Examples are the parse.com REST APIs and security model.
Any reason why we don't return confidence and fitness for each result? Seems like it could help people improve their queries
We should give developers the ability to do filtered or unfiltered searches. The default should probably be unfiltered, but when you need filtering you really need it. We just need to be able to explain it to non-experts. Here's a stab at that:
"The default search behavior is 'unfiltered' where results are returned based solely on index resolution. The optional 'filter' flag indicates each result should additionally be examined to verify it matches the query before being returned. If your query includes a constraint that won't reliably resolve from indexes, this will ensure you get accurate results. Beware that the query execution may take longer to process, possibly a very long time if most results returned from indexes aren't truly matches."
Search suggest is something to consider, in support of the query string service.
With custom query service there's a keyExists constructor for JSON. Seems like there should be equivalents for element or element-attributes in XML documents.
Some structure-driven constraints are easier/shorter to express in XPath than in the string and structured query syntax we support right now.
Perhaps we should let the user specify XPath expressions to limit documents, something like an xpathQuery parameter. For JSON documents we could use JSONPath.
We'd want the XPath to be fairly expressive (i.e. supporting complex predicates).
Open question: Do we only support searchable expression xpaths? Do we allow filtering?
General internal consensus seems to be that text documents would be good to support.
Yes.
The HTML page I cooked up was an easy first step that provided a nice looking out-of-box experience. Should file an RFE for a command line driven version as well.
--Ryan
On Nov 7, 2011, at 9:22 AM, Eric Bloch wrote:
Can we get a shell script for setup? Even if all it does is call curl...shell script can take command line options for credentials...
-Eric
The following request returns 200, with no indication that the document insert has failed (and it does fail).
xdmp:http-put("http://localhost:8004/xml/store",
{xdmp:quote("mydoc")}
user
pass
)
Seems like it ought to return 400.
People may want to query against properties using the string query syntax.
Wonder if we could make properties just another aspect of a Place.
I'm thinking of things like Norm's DocBook stylesheets (http://wiki.docbook.org/DocBookXslStylesheetDocs). If there's an XSLT library that's generally useful, high quality, and with a friendly license, we could include them to make a more complete solution.
Could also include a stylesheet to build a new custom document based on a manifest (a Chris Welch request).
Seems like it may help people to have an elapsed time in the responses. For example: It lets you compare different queries reliably without concern for network latency overhead. It lets you record historic performance to a log.
At minimum this would be good on search results, but anything where there's a structured response it seems useful. And cheap on the MarkLogic side.
It'd be useful to be able to extract a summary of Corona's state (namespaces, places, ranges, etc) in a singular serialized format such as JSON, for archive purposes, and then later push that state back using the same format.
It's similar to what MarkLogic 5 provided for the full system configuration, but this Corona version should be much more minimal and only focused on things a "corona admin" should see and maintain and should be available to those with only the corona-admin role. The MarkLogic 5 feature is for real "database admins".
Today you can do transforms using XSLT. It would be useful for the XQuery-minded among us to also allow XQuery.
If I want to rename a document it'd be a lot faster if I didn't have to pull all the data to the client and re-send it straight back to the server.
We removed the contentType= restriction from the /search endpoint because it seems like something people won't really need. If people do want it, we can re-add it back, and this time it should probably be an option in structured query.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.