GithubHelp home page GithubHelp logo

amcat4's People

Contributors

damian0604 avatar farzamfan avatar jbgruber avatar kasperwelbers avatar nruigrok avatar vanatteveldt avatar

Stargazers

 avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

amcat4's Issues

API role management

api:index_list falsely promises

    List index from this server. Returns a list of dicts containing name, role, and guest attributes

Also, create_index should allow setting guest_role, and there should be endpoints to query all project users and add/modify users. e.g. GET/POST/PUT/DELETE /index/X/user(s)

Feature Request: Significant Text aggregation

Dear maintainers,

I currently have the following problem that I would like to address: In the set of documents (here: parliamentary speeches) I want to list the most important terms of all documents returned by a query (with respect to the overally index).

For instance, when querying for the term "Russland" I get a subset of parliamenary speeches. Out of those documents (eventually also filtered by a certain date) I want to return the 20 terms that are most importantly mentioned along "Russland" (e.g. "NATO", "Krim",...). In my Python prototype (with a bingy flask backend) I could achieve this, but soon realized that this will turn into scalabilty problems. I came up with 2 approaches that could fix this.

  1. Persist TF-IDF scores for each term of a single document like this
{
text: "Dear Ladies and gentlemen, [...] Russland ,... in der NATO",
terms: [ 
    {"term": "ladies", "tfidf": 0.01},
    ... ,
    {"term":"NATO", "tfidf": 0.2}
]}

and aggregate the result terms by summing up its values. This comes in handy, as i need the TFIDF scores on a document-level anyways. Again, I think this could lead to scalibility issues, as I would be in need of a join operation of many distinct values. While I could to this with map-reduce, I am not sure if elasticsearch can do this. Therefore, I stepped back from this approach.

  1. Elasticsearch provides significant-text aggregation which seems to suit quite nicely. Is there a way already for passing such queries to Amcat4 or is there an extension needed? Currently, I am struggling a bit with mapping elasticsearch queries to Amcat4 queries, due to its (slight) different JSON structures.

I would be super happy if somebody could help :)

Get rid of SQLite db?

The SQLite db is used for index management and user management. Could we get rid of this by eg

  1. for index management, either assume all indices in elastic should be in used by amcat, or enter some metadata in elastic to "tag" / "import" an index

  2. for users, use a separate table and/or elastic security and/or an external auth source

"onboarding" process

Currently the new user experience is a bit disappointing, requiring command line action to create an admin user and getting a 404 on the / endpoint.

It would be great to:

  • Store some basic config (where is my elastic? what authentication scheme do we use?) in a config file
  • Have some sort of onboarding to fill in the config details in the web browser if it doesn't exist
  • Have a somewhat informative / endpoint, at least to acknowledge that it exists and a link to a hosted client?

import / export functions

  • easily dump an index (as zipped json-lines with some meta? as csv?)
  • easily import a backup file with option to set fields

object fields can not be hidden

I thought at first that this was an amcat4client issue, but the browser shows me an internal server error. In the amcat logs the error is:

elasticsearch.BadRequestError: BadRequestError(400, 'mapper_parsing_exception', 'Mapping definition for [dfm] has unsupported parameters:  [meta : {amcat4_display_meta=0}]')

steps to reproduce:

  1. run the actioncat example action: https://github.com/ccs-amsterdam/actioncat/tree/main#r-action-example-tidy-document-features
  2. go to the web interface and try to hide the dfm field (Fields -> Show in article)

number-like strings in numeric fields not coerces, give problems

Uploading an actual string to a numeric fields errors, but uploading a number-like character to a numeric field doesn't coerce to numeric somehow:

Uploading only a 'numeric' string gives a character result:

> create_index("test2")
> set_fields("test2", list("i"="double"))
> upload_documents("test2", data.frame(date="2000-01-01", title="bla", text="bla", i="3"))
> query_documents("test2", fields=list("i"))
Retrieved 1 results in 1 pages
# A tibble: 1 × 2
  .id                                                      i    
  <chr>                                                    <chr>
1 78de38a6983e4043628d8ada4eb6ba22b65f2b77cbb8f8ddbfd7f0cf 3    

Combing actual numbers and number-strings gives an R error from bind_rows:

> create_index("test2")
> set_fields("test2", list("i"="double"))
> upload_documents("test2", data.frame(date="2000-01-01", title="bla1", text="bla", i=1))
> upload_documents("test2", data.frame(date="2000-01-01", title="bla3", text="bla", i="3"))
> query_documents("test2", fields=list("i"))
Retrieved 1 results in 1 pages
# A tibble: 1 × 2
  .id                                                          i
  <chr>                                                    <int>
1 772dbabd82787a10e97ec6474d9243c0038bc3ca0758b248be5d9fd8     1
> query_documents("test2", fields=list("i"))
Error in `dplyr::bind_rows()`:
! Can't combine `..1$i` <integer> and `..2$i` <character>.
Run `rlang::last_error()` to see where the error occurred.

Fix highlighting in query results

Current system is a bit of a mess with

  • highlight param is either dict or bool
  • highlighting has an unclear relation with annotations

Solution:

  • drop all annotations
  • highlight as enum {none, highlights, snippets} , of highlight={type: [enum], ...}

index and field descriptions / metadata

I might be nice to be able to give some more description of an index (description, source, author/owner, etc), and also of fields (codebook, meaning, theoretical concepts, etc). This information would probably need to be stored in the system index (or in a separate index, but that feels like overkill?)

Adding an admin email does not give that user access to all indexes

It seems like 026a8ca did not fix the issue that just setting the admin email through amcat4 config is not enough.

Reprex:

docker-compose up --pull="missing" -d
docker exec -it amcat4 amcat4 create-test-index
#> [INFO   :root           ] **** Creating test index state_of_the_union ****
#> [INFO   :root           ] Creating amcat4 system index: amcat4_system
#> Reading/writing settings from .env

I leave the defaults unchanged but set allow_guests and my email as admin:

docker exec -it amcat4 amcat4 config
#> host: Host this instance is served at (needed for checking tokens)
#> The current value for host is http://localhost/amcat.
#> Enter a new value, press [enter] to leave unchanged, or press [control+c] to abort: 
#> 
#> elastic_host: Elasticsearch host
#> The current value for elastic_host is elastic7:9200.
#> Enter a new value, press [enter] to leave unchanged, or press [control+c] to abort: 
#> 
#> auth: Do we require authorization?
#>   Possible choices:
#>   - no_auth: everyone (that can reach the server) can do anything they want
#>   - allow_guests: everyone can use the server, dependent on index-level guest_role authorization settings
#>   - allow_authenticated_guests: everyone can use the server, if they have a valid middlecat login,
#> and dependent on index-level guest_role authorization settings
#>   - authorized_users_only: only people with a valid middlecat login and an explicit server role can use the server
#> 
#> The current value for auth is AuthOptions.no_auth.
#> Enter a new value, press [enter] to leave unchanged, or press [control+c] to abort:  allow_guests
#> 
#> middlecat_url: Middlecat server to trust as ID provider
#> The current value for middlecat_url is https://middlecat.up.railway.app.
#> Enter a new value, press [enter] to leave unchanged, or press [control+c] to abort: 
#> 
#> admin_email: Email address for a hardcoded admin email (useful for setup and recovery)
#> The current value for admin_email is None.
#> Enter a new value, press [enter] to leave unchanged, or press [control+c] to abort: [email protected]
#> 
#> system_index: Elasticsearch index to store authorization information in
#> The current value for system_index is amcat4_system.
#> Enter a new value, press [enter] to leave unchanged, or press [control+c] to abort: 
#> *** Written .env file to .env ***
docker restart amcat4
library(amcat4r)
amcat_login(server = "http://localhost/amcat", force_refresh = TRUE)
list_indexes()
#> # A tibble: 0 × 0

Workaround

When specifically adding the email with add-admin, it works as intended:

docker exec -it amcat4 amcat4 add-admin [email protected]
library(amcat4r)
amcat_login(server = "http://localhost/amcat", force_refresh = TRUE)
list_indexes()
#> # A tibble: 1 × 1
#>   name              
#>   <chr>             
#> 1 state_of_the_union

Feature Request: Binary attachments for documents

Content analysis can often go beyond texts. Think of manual coding jobs where we would like to annotate content that also features images (e.g., instagram posts). Sound files or PDFs (OCR content) might also be candidates.

Elastic comes with the necessary features for this, and there is also a plugin available for ingestion. I guess it would be necessary to limit the file size to prevent users from accidentially uploading ridiculously large attachments.

This is not urgent, but I think worth considering.

tags as replacement for article sets

I think it would be good if tags can be used much the same way we used article sets, i.e. groupings of articles that you can easily use in queries and add articles to.

This is mostly a UI features (providing special support for a special "tag" field or all fields with type=tag. However, to make it easy to add articles to a tag we should allow some form of bulk modify, either via a body and list of IDs, or via some form of update by query (https://www.elastic.co/guide/en/elasticsearch/reference/master/docs-update-by-query.html)

amcat4 in dependency hell

Looks like amcat4 can't be installed at the moment. If I try to install it normally, an old version of pyyaml is downloaded and then fails to install due to a known issue. If I request a newer version, the fixed version of pydantic is in conflict. If I remove the version requirement, amcat4 can be installed, but does not work since pydantic v2 moved a lot of things around and with_attrs_docs does not exist anymore (resulting in ImportError: cannot import name 'with_attrs_docs' from 'pydantic_settings').

I tried to use pydantic's upgrade tool, but it does not seem to remove with_attrs_docs, keeping the original issue and adding some new ones (e.g., pydantic.errors.PydanticUserError: A non-annotated attribute was detected: system_index = 'amcat4_system').

I think using the new pydantic is the best way forward, but I do not understand what pydantic does well enough to go through the remaining issues atm.

Add rename/reindex endpoint

There is currently no way to change the name of an index once created. It would be nice to use elastic's reindex API to create a new endpoint. The API can also be used to combine multiple indexes, which could be really useful:

POST _reindex
{
  "source": {
    "index": ["twitter", "blog"]
  },
  "dest": {
    "index": "all_together"
  }
}

sqlite3.OperationalError: no such table: user on login

Hi,

I just installed amcat4 as well as amcat4client on a fresh server. However, when I try to log on via the JS client, nothing happens, and the server backend shows that there seems to be a user table missing sqlite3.OperationalError: no such table: user (full log below). I guess this table should be created on the first run automatically? (and is there a quick fix?)
Thanks!

damian@tux01ascor:/opt/amcat4$ /usr/local/bin/uwsgi --ini /opt/amcat4/amcat4-uswgi.ini
[uWSGI] getting INI configuration from /opt/amcat4/amcat4-uswgi.ini
*** Starting uWSGI 2.0.20 (64bit) on [Wed Oct 13 08:23:11 2021] ***
compiled with version: 9.3.0 on 13 October 2021 08:03:07
os: Linux-5.4.0-88-generic #99-Ubuntu SMP Thu Sep 23 17:29:00 UTC 2021
nodename: tux01ascor
machine: x86_64
clock source: unix
detected number of CPU cores: 24
current working directory: /opt/amcat4
detected binary path: /usr/local/bin/uwsgi
!!! no internal routing support, rebuild with pcre support !!!
your processes number limit is 513470
your memory page size is 4096 bytes
detected max file descriptor number: 1024
lock engine: pthread robust mutexes
thunder lock: disabled (you can enable it with --thunder-lock)
uwsgi socket 0 bound to UNIX address /tmp/amcat4.socket fd 3
Python version: 3.8.10 (default, Sep 28 2021, 16:10:42)  [GCC 9.3.0]
*** Python threads support is disabled. You can enable it with --enable-threads ***
Python main interpreter initialized at 0x55d7c47b5900
your server socket listen backlog is limited to 100 connections
your mercy for graceful operations on workers is 60 seconds
mapped 437520 bytes (427 KB) for 5 cores
*** Operational MODE: preforking ***
WSGI app 0 (mountpoint='') ready in 0 seconds on interpreter 0x55d7c47b5900 pid: 8083 (default app)
*** uWSGI is running in multiple interpreter mode ***
spawned uWSGI master process (pid: 8083)
spawned uWSGI worker 1 (pid: 8085, cores: 1)
spawned uWSGI worker 2 (pid: 8086, cores: 1)
spawned uWSGI worker 3 (pid: 8087, cores: 1)
spawned uWSGI worker 4 (pid: 8088, cores: 1)
spawned uWSGI worker 5 (pid: 8089, cores: 1)
ERROR:amcat4.api:Exception on /auth/token/ [GET]
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/dist-packages/peewee.py", line 3144, in execute_sql
    cursor.execute(sql, params or ())
sqlite3.OperationalError: no such table: user

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.8/dist-packages/flask/app.py", line 2073, in wsgi_app
    response = self.full_dispatch_request()
  File "/usr/local/lib/python3.8/dist-packages/flask/app.py", line 1518, in full_dispatch_request
    rv = self.handle_user_exception(e)
  File "/usr/local/lib/python3.8/dist-packages/flask_cors/extension.py", line 165, in wrapped_function
    return cors_after_request(app.make_response(f(*args, **kwargs)))
  File "/usr/local/lib/python3.8/dist-packages/flask/app.py", line 1516, in full_dispatch_request
    rv = self.dispatch_request()
  File "/usr/local/lib/python3.8/dist-packages/flask/app.py", line 1502, in dispatch_request
    return self.ensure_sync(self.view_functions[rule.endpoint])(**req.view_args)
  File "/usr/local/lib/python3.8/dist-packages/flask_httpauth.py", line 389, in decorated
    return selected_auth.login_required(role=role,
  File "/usr/local/lib/python3.8/dist-packages/flask_httpauth.py", line 161, in decorated
    user = self.authenticate(auth, password)
  File "/usr/local/lib/python3.8/dist-packages/flask_httpauth.py", line 238, in authenticate
    return self.verify_password_callback(username, client_password)
  File "/opt/amcat4/./amcat4/api/common.py", line 21, in verify_password
    g.current_user = auth.verify_user(username, password)
  File "/opt/amcat4/./amcat4/auth.py", line 83, in verify_user
    user = User.get(User.email == email)
  File "/usr/local/lib/python3.8/dist-packages/peewee.py", line 6438, in get
    return sq.get()
  File "/usr/local/lib/python3.8/dist-packages/peewee.py", line 6884, in get
    return clone.execute(database)[0]
  File "/usr/local/lib/python3.8/dist-packages/peewee.py", line 1907, in inner
    return method(self, database, *args, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/peewee.py", line 1978, in execute
    return self._execute(database)
  File "/usr/local/lib/python3.8/dist-packages/peewee.py", line 2150, in _execute
    cursor = database.execute(self)
  File "/usr/local/lib/python3.8/dist-packages/peewee.py", line 3157, in execute
    return self.execute_sql(sql, params, commit=commit)
  File "/usr/local/lib/python3.8/dist-packages/peewee.py", line 3151, in execute_sql
    self.commit()
  File "/usr/local/lib/python3.8/dist-packages/peewee.py", line 2917, in __exit__
    reraise(new_type, new_type(exc_value, *exc_args), traceback)
  File "/usr/local/lib/python3.8/dist-packages/peewee.py", line 190, in reraise
    raise value.with_traceback(tb)
  File "/usr/local/lib/python3.8/dist-packages/peewee.py", line 3144, in execute_sql
    cursor.execute(sql, params or ())
peewee.OperationalError: no such table: user
[pid: 8089|app: 0|req: 1/1] 145.92.75.158 () {52 vars in 795 bytes} [Wed Oct 13 08:26:06 2021] GET /amcat4server/auth/token/ => generated 290 bytes in 14 msecs (HTTP/1.1 500) 3 headers in 131 bytes (1 switches on core 0)
ERROR:amcat4.api:Exception on /auth/token/ [GET]
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/dist-packages/peewee.py", line 3144, in execute_sql
    cursor.execute(sql, params or ())
sqlite3.OperationalError: no such table: user

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.8/dist-packages/flask/app.py", line 2073, in wsgi_app
    response = self.full_dispatch_request()
  File "/usr/local/lib/python3.8/dist-packages/flask/app.py", line 1518, in full_dispatch_request
    rv = self.handle_user_exception(e)
  File "/usr/local/lib/python3.8/dist-packages/flask_cors/extension.py", line 165, in wrapped_function
    return cors_after_request(app.make_response(f(*args, **kwargs)))
  File "/usr/local/lib/python3.8/dist-packages/flask/app.py", line 1516, in full_dispatch_request
    rv = self.dispatch_request()
  File "/usr/local/lib/python3.8/dist-packages/flask/app.py", line 1502, in dispatch_request
    return self.ensure_sync(self.view_functions[rule.endpoint])(**req.view_args)
  File "/usr/local/lib/python3.8/dist-packages/flask_httpauth.py", line 389, in decorated
    return selected_auth.login_required(role=role,
  File "/usr/local/lib/python3.8/dist-packages/flask_httpauth.py", line 161, in decorated
    user = self.authenticate(auth, password)
  File "/usr/local/lib/python3.8/dist-packages/flask_httpauth.py", line 238, in authenticate
    return self.verify_password_callback(username, client_password)
  File "/opt/amcat4/./amcat4/api/common.py", line 21, in verify_password
    g.current_user = auth.verify_user(username, password)
  File "/opt/amcat4/./amcat4/auth.py", line 83, in verify_user
    user = User.get(User.email == email)
  File "/usr/local/lib/python3.8/dist-packages/peewee.py", line 6438, in get
    return sq.get()
  File "/usr/local/lib/python3.8/dist-packages/peewee.py", line 6884, in get
    return clone.execute(database)[0]
  File "/usr/local/lib/python3.8/dist-packages/peewee.py", line 1907, in inner
    return method(self, database, *args, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/peewee.py", line 1978, in execute
    return self._execute(database)
  File "/usr/local/lib/python3.8/dist-packages/peewee.py", line 2150, in _execute
    cursor = database.execute(self)
  File "/usr/local/lib/python3.8/dist-packages/peewee.py", line 3157, in execute
    return self.execute_sql(sql, params, commit=commit)
  File "/usr/local/lib/python3.8/dist-packages/peewee.py", line 3151, in execute_sql
    self.commit()
  File "/usr/local/lib/python3.8/dist-packages/peewee.py", line 2917, in __exit__
    reraise(new_type, new_type(exc_value, *exc_args), traceback)
  File "/usr/local/lib/python3.8/dist-packages/peewee.py", line 190, in reraise
    raise value.with_traceback(tb)
  File "/usr/local/lib/python3.8/dist-packages/peewee.py", line 3144, in execute_sql
    cursor.execute(sql, params or ())
peewee.OperationalError: no such table: user
[pid: 8089|app: 0|req: 2/2] 145.92.75.158 () {52 vars in 795 bytes} [Wed Oct 13 08:26:45 2021] GET /amcat4server/auth/token/ => generated 290 bytes in 3 msecs (HTTP/1.1 500) 3 headers in 131 bytes (1 switches on core 0)
c^CSIGINT/SIGTERM received...killing workers...
worker 1 buried after 1 seconds
worker 2 buried after 1 seconds
worker 3 buried after 1 seconds
worker 4 buried after 1 seconds
worker 5 buried after 1 seconds
goodbye to uWSGI.
VACUUM: unix socket /tmp/amcat4.socket removed.

negative dates apparently give issues

From a chat with @mrwunderbar666:

"it turns out that some articles had malformed datetime strings: "-0001-11-30T00:00:00" and AmCAT just accepted them"

But then on aggregating there's an error:

nov 29 18:20:50 amcat2 uvicorn[20445]:   File "/srv/amcat4opted/amcat4/aggregate.py"
nov 29 18:20:50 amcat2 uvicorn[20445]:     result = datetime.utcfromtimestamp(result
nov 29 18:20:50 amcat2 uvicorn[20445]: ValueError: year -1 is out of range

Uploading Documents with array of `object` as property returns KeyError 'type'

Dear Maintainers,

When storing documents with a property of type 'object' I ran into an error.

Steps to reproduce:

amcat.create_index("speeches_ger4")
# does not change outcome:
# amcat.set_fields("speeches_ger4", {"term_tfidf": "object"})      
a = {
    "title": "Hallo du", 
    "date": "2021-01-01", 
    "text": "Hallo du, wie geht es dir?", 
    "term_tfidf": [
        {"term": "hallo",  "value": 0.2},
         {"term": "du", "value": 0.3}
     ]
}
amcat.upload_documents("speeches_ger4", [a])

# Also amcat.query not working
amcat.get_fields("speeches_ger4")

This returns the following error (on the amcat4 instance, the client just returns HTTP 500).


amcat4        | INFO:     172.26.0.5:34852 - "GET /index/speeches_ger4/fields HTTP/1.0" 500 Internal Server Error
amcat4        | ERROR:    Exception in ASGI application
amcat4        | Traceback (most recent call last):
amcat4        |   File "/usr/local/lib/python3.10/site-packages/uvicorn/protocols/http/httptools_impl.py", line 419, in run_asgi
amcat4        |     result = await app(  # type: ignore[func-returns-value]
amcat4        |   File "/usr/local/lib/python3.10/site-packages/uvicorn/middleware/proxy_headers.py", line 78, in __call__
amcat4        |     return await self.app(scope, receive, send)
amcat4        |   File "/usr/local/lib/python3.10/site-packages/fastapi/applications.py", line 271, in __call__
amcat4        |     await super().__call__(scope, receive, send)
amcat4        |   File "/usr/local/lib/python3.10/site-packages/starlette/applications.py", line 118, in __call__
amcat4        |     await self.middleware_stack(scope, receive, send)
amcat4        |   File "/usr/local/lib/python3.10/site-packages/starlette/middleware/errors.py", line 184, in __call__
amcat4        |     raise exc
amcat4        |   File "/usr/local/lib/python3.10/site-packages/starlette/middleware/errors.py", line 162, in __call__
amcat4        |     await self.app(scope, receive, _send)
amcat4        |   File "/usr/local/lib/python3.10/site-packages/starlette/middleware/cors.py", line 84, in __call__
amcat4        |     await self.app(scope, receive, send)
amcat4        |   File "/usr/local/lib/python3.10/site-packages/starlette/middleware/exceptions.py", line 79, in __call__
amcat4        |     raise exc
amcat4        |   File "/usr/local/lib/python3.10/site-packages/starlette/middleware/exceptions.py", line 68, in __call__
amcat4        |     await self.app(scope, receive, sender)
amcat4        |   File "/usr/local/lib/python3.10/site-packages/fastapi/middleware/asyncexitstack.py", line 21, in __call__
amcat4        |     raise e
amcat4        |   File "/usr/local/lib/python3.10/site-packages/fastapi/middleware/asyncexitstack.py", line 18, in __call__
amcat4        |     await self.app(scope, receive, send)
amcat4        |   File "/usr/local/lib/python3.10/site-packages/starlette/routing.py", line 706, in __call__
amcat4        |     await route.handle(scope, receive, send)
amcat4        |   File "/usr/local/lib/python3.10/site-packages/starlette/routing.py", line 276, in handle
amcat4        |     await self.app(scope, receive, send)
amcat4        |   File "/usr/local/lib/python3.10/site-packages/starlette/routing.py", line 66, in app
amcat4        |     response = await func(request)
amcat4        |   File "/usr/local/lib/python3.10/site-packages/fastapi/routing.py", line 237, in app
amcat4        |     raw_response = await run_endpoint_function(
amcat4        |   File "/usr/local/lib/python3.10/site-packages/fastapi/routing.py", line 165, in run_endpoint_function
amcat4        |     return await run_in_threadpool(dependant.call, **values)
amcat4        |   File "/usr/local/lib/python3.10/site-packages/starlette/concurrency.py", line 41, in run_in_threadpool
amcat4        |     return await anyio.to_thread.run_sync(func, *args)
amcat4        |   File "/usr/local/lib/python3.10/site-packages/anyio/to_thread.py", line 31, in run_sync
amcat4        |     return await get_asynclib().run_sync_in_worker_thread(
amcat4        |   File "/usr/local/lib/python3.10/site-packages/anyio/_backends/_asyncio.py", line 937, in run_sync_in_worker_thread
amcat4        |     return await future
amcat4        |   File "/usr/local/lib/python3.10/site-packages/anyio/_backends/_asyncio.py", line 867, in run
amcat4        |     result = context.run(func, *args)
amcat4        |   File "/usr/local/lib/python3.10/site-packages/amcat4/api/index.py", line 182, in get_fields
amcat4        |     return elastic.get_fields(indices)
amcat4        |   File "/usr/local/lib/python3.10/site-packages/amcat4/elastic.py", line 252, in get_fields
amcat4        |     for f, ftype in get_index_fields(ix).items():
amcat4        |   File "/usr/local/lib/python3.10/site-packages/amcat4/elastic.py", line 236, in get_index_fields
amcat4        |     result = dict(_get_fields(index))
amcat4        |   File "/usr/local/lib/python3.10/site-packages/amcat4/elastic.py", line 211, in _get_fields
amcat4        |     t = dict(name=k, type=_get_type_from_property(v))
amcat4        |   File "/usr/local/lib/python3.10/site-packages/amcat4/elastic.py", line 205, in _get_type_from_property
amcat4        |     return properties['type']
amcat4        | KeyError: 'type'

When I query the elasticsearch instance with postman directly, I can see that there is no type property. I suspect that 'object' is the default type and therefore it is not being returned.

Here is the output of
GET http://localhost:9200/speeches_ger4

{
    "speeches_ger4": {
        "aliases": {},
        "mappings": {
            "properties": {
                "date": {
                    "type": "date",
                    "format": "strict_date_optional_time"
                },
                "term_tfidf": {
                    "properties": {
                        "term": {
                            "type": "text",
                            "fields": {
                                "keyword": {
                                    "type": "keyword",
                                    "ignore_above": 256
                                }
                            }
                        },
                        "value": {
                            "type": "float"
                        }
                    }
                },
                "text": {
                    "type": "text"
                },
                "title": {
                    "type": "text"
                },
                "url": {
                    "type": "keyword",
                    "meta": {
                        "amcat4_type": "url"
                    }
                }
            }
        },
        "settings": {
            "index": {
                "routing": {
                    "allocation": {
                        "include": {
                            "_tier_preference": "data_content"
                        }
                    }
                },
                "number_of_shards": "1",
                "provided_name": "speeches_ger4",
                "creation_date": "1679334914957",
                "number_of_replicas": "1",
                "uuid": "u6kfwZbtSQSrMGJ9wz7Umw",
                "version": {
                    "created": "7170999"
                }
            }
        }
    }
}

As the document is persisted in elasticsearch accurately, I suspect that this an amcat4 problem.

Edit: I found a workaround by storing a JSON inside a field and deserializing it on the client

Timeout error

I sometimes get a timeout error and the upload stops when I use the Amcat4 client. I don't know what the reason is, because it happens quite randomly (e.g. depending on chunk_size after 13% or 20% of my documents). I think one could fix this in the Amcat-client by simply re-trying when ConnectionTimeout occurs, but unfortunately AmCAT just returns a 500 instead of forwarding the 408 from Elastic.

The Timeout error occurs only for some document collections, while it works for others. I assume that some collections have larger documents and that after 10k docs ES has to perform some garbage collection or indexing, which leads to a Timeout. I could replicate this on multiple computers with different specs, so I do not think it is a performance issue on my local machine.

Here are the logs from the client

AMCAT-Client
Traceback (most recent call last):
  File "/home/peter/anaconda3/envs/opted/lib/python3.9/site-packages/amcat4py/amcatclient.py", line 90, in _request
    r.raise_for_status()
  File "/home/peter/anaconda3/envs/opted/lib/python3.9/site-packages/requests/models.py", line 1021, in raise_for_status
    raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 500 Server Error: Internal Server Error for url: http://localhost/amcat/index/speeches_cz/documents

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/peter/uni/ParLawSpeechDashboard/preprocess_upload/preprocess.py", line 137, in <module>
    preprocess_and_upload()
  File "/home/peter/anaconda3/envs/opted/lib/python3.9/site-packages/click/core.py", line 1130, in __call__
    return self.main(*args, **kwargs)
  File "/home/peter/anaconda3/envs/opted/lib/python3.9/site-packages/click/core.py", line 1055, in main
    rv = self.invoke(ctx)
  File "/home/peter/anaconda3/envs/opted/lib/python3.9/site-packages/click/core.py", line 1404, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/peter/anaconda3/envs/opted/lib/python3.9/site-packages/click/core.py", line 760, in invoke
    return __callback(*args, **kwargs)
  File "/home/peter/uni/ParLawSpeechDashboard/preprocess_upload/preprocess.py", line 133, in preprocess_and_upload
    amcat.upload_documents(index_name, speeches_tftidf, chunk_size=10000)
  File "/home/peter/anaconda3/envs/opted/lib/python3.9/site-packages/amcat4py/amcatclient.py", line 334, in upload_documents
    self._post("documents", index=index, json=body)
  File "/home/peter/anaconda3/envs/opted/lib/python3.9/site-packages/amcat4py/amcatclient.py", line 106, in _post
    return self._request("post", url=self._url(url, index), data=data, headers=headers, ignore_status=ignore_status)
  File "/home/peter/anaconda3/envs/opted/lib/python3.9/site-packages/amcat4py/amcatclient.py", line 92, in _request
    raise AmcatError(e.response, e.request) from e
amcat4py.amcatclient.AmcatError: Error from server (500): Internal Server Error
 15%|███████████████████████                                

And here from the AmCAT instance


2023-06-11 22:21:25 WARNING:elastic_transport.node_pool:Node <Urllib3HttpNode(http://elastic7:9200)> has failed for 1 times in a row, putting on 1 second timeout
2023-06-11 22:21:25 INFO:     172.26.0.5:57596 - "POST /index/speeches_cz/documents HTTP/1.0" 500 Internal Server Error
2023-06-11 22:21:25 ERROR:    Exception in ASGI application
2023-06-11 22:21:25 Traceback (most recent call last):
2023-06-11 22:21:25   File "/usr/local/lib/python3.10/site-packages/uvicorn/protocols/http/httptools_impl.py", line 435, in run_asgi
2023-06-11 22:21:25     result = await app(  # type: ignore[func-returns-value]
2023-06-11 22:21:25   File "/usr/local/lib/python3.10/site-packages/uvicorn/middleware/proxy_headers.py", line 78, in __call__
2023-06-11 22:21:25     return await self.app(scope, receive, send)
2023-06-11 22:21:25   File "/usr/local/lib/python3.10/site-packages/fastapi/applications.py", line 276, in __call__
2023-06-11 22:21:25     await super().__call__(scope, receive, send)
2023-06-11 22:21:25   File "/usr/local/lib/python3.10/site-packages/starlette/applications.py", line 122, in __call__
2023-06-11 22:21:25     await self.middleware_stack(scope, receive, send)
2023-06-11 22:21:25   File "/usr/local/lib/python3.10/site-packages/starlette/middleware/errors.py", line 184, in __call__
2023-06-11 22:21:25     raise exc
2023-06-11 22:21:25   File "/usr/local/lib/python3.10/site-packages/starlette/middleware/errors.py", line 162, in __call__
2023-06-11 22:21:25     await self.app(scope, receive, _send)
2023-06-11 22:21:25   File "/usr/local/lib/python3.10/site-packages/starlette/middleware/cors.py", line 84, in __call__
2023-06-11 22:21:25     await self.app(scope, receive, send)
2023-06-11 22:21:25   File "/usr/local/lib/python3.10/site-packages/starlette/middleware/exceptions.py", line 79, in __call__
2023-06-11 22:21:25     raise exc
2023-06-11 22:21:25   File "/usr/local/lib/python3.10/site-packages/starlette/middleware/exceptions.py", line 68, in __call__
2023-06-11 22:21:25     await self.app(scope, receive, sender)
2023-06-11 22:21:25   File "/usr/local/lib/python3.10/site-packages/fastapi/middleware/asyncexitstack.py", line 21, in __call__
2023-06-11 22:21:25     raise e
2023-06-11 22:21:25   File "/usr/local/lib/python3.10/site-packages/fastapi/middleware/asyncexitstack.py", line 18, in __call__
2023-06-11 22:21:25     await self.app(scope, receive, send)
2023-06-11 22:21:25   File "/usr/local/lib/python3.10/site-packages/starlette/routing.py", line 718, in __call__
2023-06-11 22:21:25     await route.handle(scope, receive, send)
2023-06-11 22:21:25   File "/usr/local/lib/python3.10/site-packages/starlette/routing.py", line 276, in handle
2023-06-11 22:21:25     await self.app(scope, receive, send)
2023-06-11 22:21:25   File "/usr/local/lib/python3.10/site-packages/starlette/routing.py", line 66, in app
2023-06-11 22:21:25     response = await func(request)
2023-06-11 22:21:25   File "/usr/local/lib/python3.10/site-packages/fastapi/routing.py", line 237, in app
2023-06-11 22:21:25     raw_response = await run_endpoint_function(
2023-06-11 22:21:25   File "/usr/local/lib/python3.10/site-packages/fastapi/routing.py", line 165, in run_endpoint_function
2023-06-11 22:21:25     return await run_in_threadpool(dependant.call, **values)
2023-06-11 22:21:25   File "/usr/local/lib/python3.10/site-packages/starlette/concurrency.py", line 41, in run_in_threadpool
2023-06-11 22:21:25     return await anyio.to_thread.run_sync(func, *args)
2023-06-11 22:21:25   File "/usr/local/lib/python3.10/site-packages/anyio/to_thread.py", line 31, in run_sync
2023-06-11 22:21:25     return await get_asynclib().run_sync_in_worker_thread(
2023-06-11 22:21:25   File "/usr/local/lib/python3.10/site-packages/anyio/_backends/_asyncio.py", line 937, in run_sync_in_worker_thread
2023-06-11 22:21:25     return await future
2023-06-11 22:21:25   File "/usr/local/lib/python3.10/site-packages/anyio/_backends/_asyncio.py", line 867, in run
2023-06-11 22:21:25     result = context.run(func, *args)
2023-06-11 22:21:25   File "/usr/local/lib/python3.10/site-packages/amcat4/api/index.py", line 155, in upload_documents
2023-06-11 22:21:25     return elastic.upload_documents(ix, documents, columns)
2023-06-11 22:21:25   File "/usr/local/lib/python3.10/site-packages/amcat4/elastic.py", line 179, in upload_documents
2023-06-11 22:21:25     bulk(es(), actions)
2023-06-11 22:21:25   File "/usr/local/lib/python3.10/site-packages/elasticsearch/helpers/actions.py", line 521, in bulk
2023-06-11 22:21:25     for ok, item in streaming_bulk(
2023-06-11 22:21:25   File "/usr/local/lib/python3.10/site-packages/elasticsearch/helpers/actions.py", line 436, in streaming_bulk
2023-06-11 22:21:25     for data, (ok, info) in zip(
2023-06-11 22:21:25   File "/usr/local/lib/python3.10/site-packages/elasticsearch/helpers/actions.py", line 339, in _process_bulk_chunk
2023-06-11 22:21:25     resp = client.bulk(*args, operations=bulk_actions, **kwargs)  # type: ignore[arg-type]
2023-06-11 22:21:25   File "/usr/local/lib/python3.10/site-packages/elasticsearch/_sync/client/utils.py", line 414, in wrapped
2023-06-11 22:21:25     return api(*args, **kwargs)
2023-06-11 22:21:25   File "/usr/local/lib/python3.10/site-packages/elasticsearch/_sync/client/__init__.py", line 702, in bulk
2023-06-11 22:21:25     return self.perform_request(  # type: ignore[return-value]
2023-06-11 22:21:25   File "/usr/local/lib/python3.10/site-packages/elasticsearch/_sync/client/_base.py", line 285, in perform_request
2023-06-11 22:21:25     meta, resp_body = self.transport.perform_request(
2023-06-11 22:21:25   File "/usr/local/lib/python3.10/site-packages/elastic_transport/_transport.py", line 329, in perform_request
2023-06-11 22:21:25     meta, raw_data = node.perform_request(
2023-06-11 22:21:25   File "/usr/local/lib/python3.10/site-packages/elastic_transport/_node/_http_urllib3.py", line 199, in perform_request
2023-06-11 22:21:25     raise err from None
2023-06-11 22:21:25 elastic_transport.ConnectionTimeout: Connection timed out

deduplication

The hashing was removed in the last refactor, but we decided that it would make sense to reintroduce id-as-hash whenever users do not supply their own _id

e0efef6
e76f8e5

mutatbility and hash as id

Some (or all?) article/document fields should be considered immutable, e.g. [date, title, url, text], and we can then use a hash of these fields as ID. This will make it easier to move articles between servers as they can keep their ID, and also there is a guarantee that an article with that ID is indeed not changed (good for reproducibility).

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.