ethereum / glados Goto Github PK

Portal network monitoring application.

Rust 59.40% HTML 8.78% JavaScript 30.69% CSS 0.58% Dockerfile 0.52% Makefile 0.02%

glados's Issues

Latest content validation causes deserialize failure due to sequence instead of string

Error seen when running local glados instance, due to validation addition in #99.

[2023-04-07T08:52:31Z WARN  glados_audit::validation] could not deserialize content bytes content.value="0x080000001b020000f9021..." err=Error("invalid type: sequence, expected a string", line: 0, column: 0)

At first glance, it seems to be rather an issue in what is expected as JSON data here: https://github.com/ethereum/glados/blob/master/glados-audit/src/validation.rs#L10.

I might be wrong as I don't really know this code base but I think the data coming from get_content (https://github.com/ethereum/glados/blob/master/glados-audit/src/lib.rs) is just the raw SSZ encoded bytes, and thus likely not to be accepted by the JSON parsing in the validation code?

Database is not initialized for in-memory db

Description

In glados-monitor, when an in-memory database is created using the "sqlite::memory:" argument, as in:

$ cargo run -p glados-monitor -- --database-url sqlite::memory: follow-head --provider-url http://127.0.0.1:8545

The following error is encountered:

Query(SqlxError(Database(SqliteError { code: 1, message: "no such table: content_key" })))

path
glados/entity/src/contentkey.rs:46:10

This indicates that upon lookup of the first key in the table, the action fails as the result of a non-existent table. 
```rs
// glados/entity/src/contentkey.rs
get_or_create(content_key_raw: &impl ContentKey, conn: &DatabaseConnection) -> Model {
    // First try to lookup an existing entry.
    let content_key = Entity::find()
        .filter(Column::ContentKey.eq(content_key_raw.encode()))
        .one(conn)
        .await
        .unwrap(); <--- panic
    // snip
}

Resolution

Some options is see:

Require that the CLI flags prevent in-memory-db + no-migration combination
Run db migration if in-memory-db is selected (this initialises the db)

Improve web view of audit information.

Some notes on the information we want to be able to gleam from glados.

in the last 1-hour/24-hours/7-days
- total number of new content items
- total number of audits performed with success/fail percentages
the N most recent:
- successful audits
- failed audits
- new content items

Need links to view content:

List views of content that show:
- content key / content-id / content-type / first-available-at / created-at / most-recent-audit-result
Ability to "filter/sort" content
- filter by kind/type (multi-select)
- filter by datetime range
- sort by first-available-at
- sort by created-at
- filter by most-recent-audit-result (success/failure)

Roadmap notes

End goal is roughly:

configure application with any number of JSON-RPC connections to running nodes.
web application serves data from database
long running process collects information from the various running clients.

Starting point

single running client

basic node routing table information
network explorer enumerating ENR records and node ids
explore data radius values

Add new timestamp field to content to represent first moment the data should have been available.

We currently have a created_at field on the ContentKey model, which is being set to the timestamp when the database entry was created....

I'd like to add another timestamp field first_available_at that is set to the time when the content should have first been available.

For headers/bodies/receipts this should be set to the timestamp of the corresponding block.
For epoch accumulators, this should be set to the timestamp of the last block in the accumulator.

The glados-monitor will need to be updated to correctly populate this field for all new entries.

The import_pre_merge_accumulators script in glados-monitor/src/lib.rs will need to be updated to populate this field.

We also need a script that can be used to populate existing records in the database for which this value is missing. This script should probably take something like an Infura URI or some other way of getting at a running JSON-RPC API to be able t fetch block data.

Step 1: Add new timestamp field as nullable.
Step 2: run script to populate historical records
Step 3: modify database field to be non-nullable

I believe this will also depend on us having a script that will backfill missing content-key metadata, since #45 is now merged and deployed, but the main instance of glados is missing most of the metadata for values that were already present in the database. Depends on https://github.com/pipermerriam/glados/issues/65

Content Dashboard loads slowly

http://glados.ethportal.net/content/ loads slowly.

The endpoint for this page has 8 queries. The time they currently take (with 3.9 mil audit rows) are as follows:

Content ID: 120 ms
Content: 1641 ms
Audits: 451 ms
Successes: 303 ms
Failures: 377 ms
Hour stats: 941 ms
Day stats: 938 ms
Week stats 958 ms
= 5729 ms

The two priorities should be improving

"Content" which is the most recent content that has been audited
Stats, note that this is with #134 so despite being sped up it's still a second per stat period.

A way to view audit metrics over time

Glados is good at showing us how well the network is doing right now.

I think the main index/dashboard should be updated so that it provides context for how well we are doing with respect to how well we previously were doing. A simple view of this might be showing a digest for the last month, with each day segmented off and showing aggregate success fail metrics.

This week 70% success
Last week 60% success
The week before that 62% success

The current "last hour", "last day" and "last week" metrics are still good, but I think they become less interesting/important than looking at how well we've been doing over longer time arcs.

How to handle Portal node timeouts

Description

Should a timeout be entered into the database as a failed audit or logged as an error?

glados-audit can either recieve

A Portal node can respond with "no content" ("0x") message.
A timeout. E.g., I have observed that a HTTP connection usually dies at 60s with Trin. Sometimes a message comes through (with the content or with the no-content message at ~50 seconds. So it seems plausible that for some content a timeout could occur despite the content existing but being inaccessible within the timeout.

Current behaviour: Timout is logged as an error in glados-audit

I am inclined to keep this behaviour.

Use `ethportal_api::jsonrpsee::Client` instead of glados internal json rpc client implementation

Write tests to make sure glados is validating content correctly

Currently we have a high success rate on our testnet, but glados has a bug where it says the audit failed. But we can see in the trace that it did find the content.

^ here is a picture (the green bubble is that we found the content)

The reason we want this test is so if we are updating library versions we know that the code will still pass. We want to catch the bug before at the PR stage to prevent headache and to ensure we know it is a issue with the network not glados itself.

Audits showing duplicates in content dashboard

http://173.255.209.7:3001/content/

See that the list of audits contains duplicates... figure out why

New block_number feature clash with postgres

Description

When main branch is run with a (new/empty) Postgres backend the following error occurs:

Query Error: error occurred while decoding column "block_number": mismatched types; 
Rust type `core::option::Option<i64>` (as SQL type `INT8`) is not compatible with SQL type `INT4`

Origin

This was introduced by PR #45, which was only tested using sqlite. Root cause TBD.

Full logged error

Failed to create database record 
content.key="0x004be441720f239cdc201bcafc41a29d86f5f4056005d0af29176ce0e19ade2c33" 
content.kind="header_metadata" 
err=Query Error: error occurred while decoding column "block_number": mismatched types; 
Rust type `core::option::Option<i64>` (as SQL type `INT8`) is not compatible with SQL type `INT4`

Solution

Possibly downgrade block_number to be a 32 bit not 64 bit.

Add a mode to `glados-monitor` for backfilling missing chain information.

depends on: #35

We need a mode for glados-monitor that looks at the database and finds missing historical blocks and backfills them.

My initial thoughts on how we model this in the CLI would be:

# normal mode that follows the head of the chain (this can be the default)
glados-monitor --mode follow-head

# backfill mode that fills until it reaches the head of the chain and then either exits or sleeps and the restarts it's search for missing information)
glados-monitor --mode backfill

Consider adding glados to SeaORM showcase!

Hey @pipermerriam, thanks for adopting SeaORM!
It's our pleasure to see more inspirational projects were built on top of SeaORM :)

Let us know if you have any feature recommendation or feedback. Your contribution is what drive us forward!

Some learning resources for you: Documentation, Tutorial, Cookbook, Q&A
Join our Discord server to chat with others in the SeaQL community!

Feel free to submit a PR to showcase your project, SeaQL/sea-orm#403.

Number of audit tasks generated-per-min is not configurable

Description

glados-audit can be viewed as a funnel as follows:

flowchart TD
subgraph generate[Trigger every AUDIT_SELECTION_PERIOD_SECONDS]
s1[Strategy Latest]
s2[Strategy Random]
s3[Strategy Random]
end
s1  & s2 & s3 --> |send KEYS_PER_PERIOD tasks|chan[Audit task channel]


chan --> |take 1 task|a1 & a2 & a3 & a4
subgraph fulfill[Continuously replenish with new threads once they complete]
a1[Auditing thread 1]
a2[Auditing thread 2]
a3[...]
a4[Auditing thread CONCURRENCY]
end
a1 & a2 & a3 & a4 --> node[Portal node]

At present, the CLI can control the throughput as follows:

--concurrency <n> flag controls the maximum funnel output rate.
--strategy <strat> flag controls the nature of tasks generated (limited effect on throughput. E.g., setting multiple --strategy random)

The two variables that control the maximum funnel input rate are:

KEYS_PER_PERIOD. Currently hard coded as 10.
AUDIT_SELECTION_PERIOD_SECONDS. Currently hard coded as 120 (seconds)

Thus max audits per minute can be calculated:

one active strategy, the funnel is filled at 10/120 * 60 = 5 tasks (individual content key audits) per minute.
Currently the default is three active strategies and the funnel is filled at 3 * 10/120 * 60 = 15 tasks (individual content key audits) per minute.

Noting that observed audits/min rate will be lower because audits that timeout against a portal node are not recorded as pass/fail.

The funnel has a "rim height" set to overflow at 100 pending tasks. That is, when the channel has 100 pending tasks, new
tasks generated at this point will be discarded.

Resolution

Expose funnel input control from the CLI. Options flag to:

Expose KEYS_PER_PERIOD variable.
Expose AUDIT_SELECTION_PERIOD_SECONDS variable.
Expose KEYS_PER_PERIOD and AUDIT_SELECTION_PERIOD_SECONDS variables.
New --max-task-rate <n = max audits per min> flag that controls maximum audits per minute that are generated.
a. Titrate AUDIT_SELECTION_PERIOD_SECONDS to n, taking into account number of strategies and KEYS_PER_PERIOD
b. Titrate KEYS_PER_PERIOD to n, taking into account number of strategies and AUDIT_SELECTION_PERIOD_SECONDS.

Current flags

Usage: glados-audit [OPTIONS] --transport <TRANSPORT>

Options:
  -d, --database-url <DATABASE_URL>
          [default: sqlite::memory:]

  -i, --ipc-path <IPC_PATH>
          

  -u, --http-url <HTTP_URL>
          

  -t, --transport <TRANSPORT>
          [possible values: ipc, http]

  -c, --concurrency <CONCURRENCY>
          number of auditing threads
          
          [default: 4]

  -s, --strategy <STRATEGY>
          Specific strategy to use. Default is to use all available strategies. May be passed multiple times for multiple strategies (--strategy latest --strategy random). Duplicates are permitted (--strategy random --strategy random).

          Possible values:
          - latest:
            Content that is: 1. Not yet audited 2. Sorted by date entered into glados database (newest first)
          - random:
            Randomly selected content
          - failed:
            Content that looks for failed audits and checks whether the data is still missing. 1. Key was audited previously 2. Latest audit for the key failed (data absent) 3. Keys sorted by date audited (keys with oldest failed audit first)
          - select_oldest_unaudited:
            Content that is: 1. Not yet audited. 2. Sorted by date entered into glados database (oldest first)

  -h, --help
          Print help information (use `-h` for a summary)

  -V, --version
          Print version information

Local validation of content

Currently glados is happy with a non-zero response when it audits content.

This should be changed to actually check that the content returned from the network correctly passes validation.

For headers, we should re-construct the RLP header and verify it hashes to the block hash.
For bodies we should reconstruct the transaction and uncle tries and verfiy them against the corresponding headers fields
For receipts we should reconstruct the trie and verify it against the corresponding header field
For the accumulator, we should verify that the epoch hash matches the one from the master accumulator

Backfill metadata tables

What is wrong

With #45 merged and deployed, we now have lots of ContentKey/Id records in the production glados instance that don't have these populated.

How can it be fixed

Write a script that can be run against an existing database which will find ContentKey rows that don't have associated meta-data, and that backfills this data.

Add additional audit strategies

Description

At present the strategies are:

latest
random
failed (prev missing)

New strategies suggested:

And then I think in the future we'd also want a Missing strategy that looks up data that has never been audited, but not necessarily the newest. Maybe OldestMissing that works its way forward from the beginning of the chain.

Originally posted by @mrferris in https://github.com/pipermerriam/glados/pull/60#discussion_r1106250734

Failure of BlockHeader validation since Shanghai

Running glados since Shanghai / Capella fork results in failure of header validation.

I did not investigate this further but presumably this is because of the added withdrawals_root field in the BlockHeader (see EIP-4895)

Block bodies and receipts still work. Bodies did change also (added withdrawals), but we didn't actually alter those yet in the Portal specifications/implementations. BlockHeader requires the added field immediatly however because else the block hash itself is no longer valid.

Example error:

[2023-04-13T13:07:32Z WARN  glados_audit::validation] computed header hash did not match expected content.key="0x00327395c9900fbd349f338069b0ecc98a547e52ad0e9f430ff13a5fe314669176" content.value="0x080000003d020000f90232a032a4a6a8ebb57c8fb34ecb81e4d331bef22f80bc87bc98d650ac7ea982bc8522a01dcc4de8dec75d7aab85b567b6ccd41ad312451b948a7413f0a142fd40d4934794388c818ca8b9251b393131c08a736a67ccb19297a049cba5a8779350f39813bb849fa94e1020288e79d9de694740462d26f55eee2fa022ec8930142a33421dba2ddfd4284bcb4d7dcf40c5718ce4d800d18101464bf9a0759d23c1d8bcaacda19cb90a64d29558f1a36cfe447a44553f5b4fe4e313039ab90100bcb9528483419a03966a0d00a65567ab2a31a899d4913a2ac881330464012039923413dc2514902124a03f18340abd5a0b83bc1a8b8a3cc972960640307853012686c6026a1a986ba9b64819d5386ef69c81088691411ec992e35b659c00a259ae60a01612222457252f52a0124aa9690253687d5454af0aa60710da8808fd086146c0d0131495ec08f8cf75d39a20b703a22d818344db1875cfbada4538703bab5a61d27965e24714e177c4591cadef7d61c2c002c5817a13af4472b0bc00d8e5c803073343ba721d12c50b30dc7209d6ee9902e444061ce4a112269a17383010d8e63b8852400d16ccb389a4e20405e877109f10c8507422cc91f8e45646a480840103fd7b8401c9c380839c7a1f846437fe378f6265617665726275696c642e6f7267a0726e88cc67d9fcf2c4fd23b50158911fc1633a991f131aa6dc4f8c43d4a288ef880000000000000000850b097be493a06f6a7789c8b169158508f74372ca3bae7de88a4ad7a802d1a1b49030db3b250d00"

OverlayContentKey use does not include selector

Description

A T: OverlayContentKey passed into a function does not have a way to get the actual content_key (e.g., the bytes including the selector).

Use of .into(), which is the counterpart to From and has the following implementation.

// trin/ethportal-api/src/types/content_key.rs
impl From<HistoryContentKey> for Vec<u8> {
    fn from(val: HistoryContentKey) -> Self {
        val.as_ssz_bytes()
    }
}

This gets the bytes of the enum, using the derived Encode

/// A content key in the history overlay network.
#[derive(Clone, Debug, Decode, Encode, Eq, PartialEq)]
#[ssz(enum_behaviour = "union")]
pub enum HistoryContentKey {

Which does not include the selector.

Add support for distance queries against content

Now that #114 is merged, we need to do the same for content.

add a new content_id_high: i64 field to the content table.
add ability to query for content closest to a specific node-id
add ability to query for nodes closest to a specific content-id
add ability to query for content closest to other content.
add all of these to the web views in some way.

Unify logging formatting

Go through all the logging and make sure that everything is being logged in ergonomic ways.

binary data should be hex encoded.
structured logging should be using the same key/value pairs
lists are better with 3 things.

Absent 0x prefix in portal network request

Description

In a call to "portal_historyRecursiveFindContent", the parameters sent do not include "0x" prefix. This results
in an error

Error while processing portal_historyRecursiveFindContent: Error returned from chain history subnetwork: \"Invalid RecursiveFindContent params: \\\"Unable to decode content_key\\\"\"

Occurs in: glados_core::jsonrpc::PortalClient::get_content().

Solution

Likely fixed/obsoleted by upcoming json-rpc changes in trin
Implies that an OverlayContentKey type should have a "as_hex_string()" convenience method to prevent similar.
May be quick-fixed by adding "0x" to the string.

Gracefully handle failed lookups in glados-web

What is wrong

Trying to view a content key/id in the web interface for something that isn't present in the database results in a panic.

How can it be fixed

Provide a reasonable 404 not found page instead of exploding

Add meta/context information to the content-key/content-id information in the database.

We need to add some new information to our database that lets us have a richer view into what each content-key/content-id represents.

The new tables should loosely be something like:

execution_header
---
block_number: uint64 (unique)
block_hash: [u8, 32] (unique)
content_id: foreign_key_to(content_id) (unique)

execution_body
---
block_number: uint64 (unique)
block_hash: [u8, 32] (unique)
content_id: foreign_key_to(content_id) (unique)

execution_receipts
---
block_number: uint64 (unique)
block_hash: [u8, 32] (unique)
content_id: foreign_key_to(content_id) (unique)

The glados-monitor process needs to be updated to populate these tables with information.

Audit & Display Beacon Data

Tasks

Beta Give feedback

No tasks being tracked yet.

Options

Cannot create Non-Unique indices using SeaORM

There are a few SQL indices that we want for performance purposes that we do not want to have UNIQUE constraints:

(content_audit::CreatedAt, content_audit::Result)
(content::FirstAvailableAt, content::ProtocolId)
key_value::Key
execution_metadata::BlockNumber

Creating a non-unique index in seaORM via eg:

 .index(
         Index::create()
        .name("idx_execution-block_number")
       .col(ExecutionMetadata::BlockNumber),
)

results in the error:

thread 'main' panicked at 'Database migration failed: Exec(SqlxError(Database(PgDatabaseError { severity: Error, code: "42601", message: "syntax error at or near \"(\"", detail: None, hint: None, position: Some(Original(172)), where: None, schema: None, table: None, column: None, data_type: None, constraint: None, file: Some("scan.l"), line: Some(1188), routine: Some("scanner_yyerror") })))', glados-monitor/src/main.rs:47:14

The offending SQL appears to be:

CREATE TABLE "execution_metadata" (
      "id" serial NOT NULL PRIMARY KEY,
      "content" integer NOT NULL,
      "block_number" integer NOT NULL,
      CONSTRAINT "idx_execution-block_number" ("block_number"),
      CONSTRAINT "idx-unique-metadata" UNIQUE ("content"),
      CONSTRAINT "FK_executionmetadata_content" FOREIGN KEY ("content") REFERENCES "content" ("id") ON DELETE
      SET
        NULL ON
      UPDATE
        CASCADE
    )

while the working SQL generated with a unique constraint looks like:

CREATE TABLE "execution_metadata" (
      "id" serial NOT NULL PRIMARY KEY,
      "content" integer NOT NULL,
      "block_number" integer NOT NULL,
      CONSTRAINT "idx_execution-block_number" UNIQUE ("block_number"),
      CONSTRAINT "idx-unique-metadata" UNIQUE ("content"),
      CONSTRAINT "FK_executionmetadata_content" FOREIGN KEY ("content") REFERENCES "content" ("id") ON DELETE
      SET
        NULL ON
      UPDATE
        CASCADE
    )

The only difference is the UNIQUE on the fifth line, so how that's resulting in a "syntax error at or near \"(\"" is unclear.

Support for running multiple clients for glados-audit

With #85 being added, it seems we now will want the ability to run glados-audit with multiple clients.

Change the audit schema to allow storing information about the client that was used for the audit. I suggest we store both a reference to the ENR record and the client version string.
Change to allow "multiple" clients (requiring at least 1) to be specified via --ipc-path or --http-url
Change the audit process to round-robin audits across all of the available clients

Add ability to run audits with different selection strategy

(low priority, only useful once the network is actually mostly working and most audits are successful)

depends on: https://github.com/pipermerriam/glados/issues/37

Lets call the strategy laid out in #37 the latest strategy, as it focuses on auditing the newest stuff.

We want two more strategies, and to modify glados-audit in any way necessary to allow us to run multiple audit processes concurrently (which might require database locking of some sort).

A random strategy, that simply randomly audits things.
A missing strategy that looks for failed audits and checks whether the data is still missing.

Change audit strategy

The current selection strategy for content auditing needs to be updated. The priority for auditing should be as follows.

Query content_id table and order by number of audits that have been performed (fewer first, more last)
Secondary ordering by creation date content_id.created_at in descending order (newer first, older last)

This ensures that we focus on auditing the latest stuff first that has never been audited, and then once everything in the database has been audited, it starts focusing on re-auditing things with priority on the newest of those items.

Timestamp problem

Dec 14 18:44:07 localhost run_glados_monitor.sh[699373]: thread 'tokio-runtime-worker' panicked at 'Error inserting new content key: Query(SqlxError(ColumnDecode { index: "\"created_at\"", source: "mismatched types; Rust type `core::option::Option<chrono::datetime::DateTime<chrono::offset::utc::Utc>>` (as SQL type `TIMESTAMPTZ`) is not compatible with SQL type `TIMESTAMP`" }))', /root/glados/entity/src/contentkey.rs:64:14

Just another issues that showed up when switching from sqlite3 -> postgres

Audits do not record the originating strategy

Description

One suggestion for a future PR is that we include within the audit's DB record which selection strategy it was triggered by.

When an audit fails, it may be exclusively from a particular strategy. By incorporating the source strategy such correlations could be interrogated

Accept `null` as "no content" response

Description

Glados treats null responses from Portal Network node portal_recursiveFindContentResult as an error. It should treat these the same as "0x" (interpreted as "content not found")

I had followed some discussion in the following PR (ethereum/portal-network-specs#176), but now I see that this is specifically for portal_*LocalContent, which uses a special "0x0" response.

Looking at the spec

  "RecursiveFindContentResult": {
    "name": "recursiveFindContentResult",
    "description": "The data corresponding to the lookup target",
    "schema": {
      "title": "Encoded target content data",
      "$ref": "#/components/schemas/hexString"
    }
  },

Where hexString is:

  "hexString": {
    "title": "Hex string",
    "type": "string",
    "pattern": "^0x[0-9a-f]$"
  }

So null and "0x" both seem valid and Glados should look for these to mean "content is absent". Rather than the current behaviour of looking only for "0x".

Relevant code:

https://github.com/ethereum/glados/blob/master/glados-core/src/jsonrpc.rs#L166-L186

Glados audit incompatibility with Fluffy - fix out of date portal JSON-RPC API?

Latest version of glados is no longer compatible with Fluffy.

Each audit returns a failure even though there is clearly data arriving:

[2023-09-20T12:37:11Z DEBUG hyper::proto::h1::conn] incoming body is content-length (73007 bytes)
[2023-09-20T12:37:11Z DEBUG hyper::proto::h1::conn] incoming body completed
[2023-09-20T12:37:11Z ERROR glados_audit] Problem requesting content from Portal node. content.key="0x01e57dc6f3241f3a4f8c55293a3ec3afe21563f3614c0cfa049facc7314ee2460b" err=ContainsNone

From some further debug it looks like the json response parsing (as_str) is failing: https://github.com/ethereum/glados/blob/master/glados-core/src/jsonrpc.rs#L200

There has been a change related to this to the JSON RPCall: ethereum/portal-network-specs@92b79b8

Retesting this with a modified fluffy build that returns just the content(like before the spec change) , not the object works.

Can it be that the json-rpc portal api here needs an update?

In fluffy we implemented this change as else portal-hive was failing, so not sure how this works for Trin, perhaps due to the usage of portal_historyTraceRecursiveFindContent in glados?

Web view for DHT census data

We should have a web view to explore census data.

Paginated list of all past census data showing (Id/CreatedAt/NumNodes)
Detail page for a census that shows meta stats (CreatedAt/NumNodes), client diversity graph, and a list of all found ENR records

Make auditing compatible with different sub-protocols

Description

The auditing is specific to the History network. If it were to be made agnostic to sub-protocols then this will save work later on.
The cause of the current limitation is that we:

✅ (glados-monitor) Follow a Portal node and record content keys/ids/metadata in a glados-db
✅ (glados-audit) Employ different strategies to decide on what glados-db content to audit.
(glados-audit) presume all content in glados-db are from the History sub-protocol and send HistoryContentKeys in the mpsc audit channel. ❌ Would not handle other sub-protocols.
(glados-audit) For each task in the audit channel, perform a portal_historyRecursiveFindContent request to a Portal Node. ❌ Would not handle other sub-protocols.

Solution

In three parts:

Record the sub-protocol in the glados-db.

Currently we do this with the execution metadata tables, for header, body & receipts but not EpochAccumulator. Options are:
A. Treat these tables as sub-protocol identifiers. Would need to add a table for EpochAccumulator items, and any items added for different sub-protocols
B. Add a new table for sub-protocol

Option B seems cleaner because in the event of a second sub-protocol, you do not need to check multiple tables to find matches.

Make mpsc channel more broad

The channel currently handles HistoryContentKey. It could send database IDs in the channel.

Lookup sub-protocols at audit time

The audit task can operate as follows:

1. Look up what sub-protocol the item is from
1. Call the appropriate portal_*RecursiveFindContent
1. Log the appropriate metadata (based on the sub-protocol)

Considerations

Any key entered into the glados-db may clash with an identical key on a different sub-protocol. To handle this,
keys should AFAICT have a many-to-one relationship with a sub-protocol identifier.

Decide on a key to audit
- Look up what sub-protocol it is from
- Audit on that sub-protocol (E.g., portal_historyRecursiveFindContent)
- Record audit result for that content
Glados monitor creates a second key that is identical (to the above) but on a different subprotocol
- Need to store the key, but in a way so as to not inherit the audit from the similar key
  - Perhaps by storing the sub-protocol along with the audit date/result data.

Not yet sure of the best way to a organise a sub-protocol table / foreign key to achieve this.

Alternatives

Maybe there are other better solutions.

? Separate database for sub-protocols

Network explorer

Need to write the network explorer pieces.

process that regularly (every N minutes) "walks" the network using RFN (recursive find nodes) to enumerate all knowable ENR records.
a process that looks up ENR records from the database and uses PING messages to determine "liveliness". At present, we will fail on any node behind a NAT since we don't have traversal.

Some web views that allow exploration of this data.

Filter audit data & stats by portal data type

Add block metadata to web view

The web application should be updated to include the new block metadata in any contentkey/id displays so that when we view a content key/id in the web application we can know if it is a header/body/receipts/accumulator

Import accumulators

write a script to import all of the accumulator content keys from https://github.com/njgheorghita/portal-accumulators

Add concurrency to auditing

What is wrong

Currently, I believe our auditing is effectively single threaded, aka, we only ever hit our connected portal node (trin) with a single running content lookup. Since content lookups can take a bit of time due to needing to traverse the network, we should probably run multiple of these concurrently.

How can it be fixed

Change glados-audit to be able to run multiple lookups concurrently.

Add a new flag to glados-audit such as --concurrency N which sets the number of concurrent lookups that can be running at any given time.

Reduce channel buffer size for `glados-audit`

what is wrong?

Glados seems to lag roughly 4 minutes behind "now" for auditing "latest" content.

how can it be fixed.

This is just a guess but...

glados/glados-audit/src/lib.rs

Line 143 in abebee1

let (collation_tx, collation_rx) = mpsc::channel::<AuditTask>(100);

and

glados/glados-audit/src/lib.rs

Line 124 in abebee1

let (tx, rx) = mpsc::channel::<AuditTask>(100);

The channel size of 100, and glados running at roughly 50 audits-per-minute, means that things will reside in the channel for as much as 2 minutes in the main collation channel after already waiting at least the same length in the individual strategy channel before being picked up. this is too long!

Lets try something dumb/simple like changing both of these numbers to 4

Monitor may miss blocks

Description

The way errors are handled in glados-monitor, within the follow_chain_head() and retrieve_new_blocks() leads to situations where a new block is not stored in the database.

For example:

In transmitting a new block number to the retrieval thread, the message may fail,
During retrieval the block contents are not received properly, there is no attempt to try again later.

Discussion

I suspect that this is a non-issue, because glados performs a sampling based audit, rather than a completeness audit.

That is, glados creates a record of keys to challenge the portal node with. It should pass all keys tested (not all canonical keys). The tested keys are a subset of all keys, so if glados fails to record every block at the chain head, the sampling will still be valid.

Actions

If this the right way to think about it, this issue can be closed.
If glados should strive for completeness, I can take a look at making glados-monitor handle these error cases by remembering/retrying rather than moving on.

Table hierarchy

Description

Need to decide if content_id (current) or content_key are the subject matter for operations in glados.

Highest ranked item:

Trin: content_key. In ethportal-api we define OverlayContentKey which has the method content_id. So content_key is the top level
Glados: content_id. In glados/entities we define a table contentid with a "has_many" content_keys relationship. So the content_id is the top level.

This seems to be inverted in glados.

Discussion

What are the downsides to staying with content_id?
- Mental overhead: Function names are content_key-centric
- Additional lookups: block data -> content_key table -> content_id table

Show metrics dashboard for glados's portal nodes that it uses for auditing

problem with migrations

Execution Error: error returned from database: foreign key constraint "fk_enr_id_node_id" cannot be implemented

Show clients in glados content dashboard

Add Radius and "Should store" column in Audit page

The audit page is very useful to track how content is available over the network and how many hops are done for a recursive find content.

Now when a No Progress node occurs, it would be nice to know whether that node should have had the content or not.

I think that to achieve this we could add a radius column that could optionally be filled in if the Origin node happens to know the radius of the No Progress (or other) nodes. Additionally, the calculation, based on distance and radius of this node, can be done to see whether this node should store this content in theory. This could be indicated in another column, or perhaps be indicated with a color, e.g. marking the whole row red if it didn't have the content but should have had it.

Additional summaries could be made based on how many nodes failed to fulfill their task, etc.

I assume that the portal_historyTraceRecursiveFindContent call would need a slight adaptation to include the optional radius information.

edit: Actually, the information is equally useful for Responded nodes.

Improve DHT census code routing table enumeration

Once #117 is merged there is improvement that can be done to the routing table enumeration.

Presently, we simply send a FIND_NODES request for all buckets between 245-256. We should be able to cut this number of requests down significantly.

A smarter tactic would be to change this range to be dynamic. Querying bucket 256 is rarely useful because everything in that bucket will almost always be in someone else's lower numbered bucket that is closer. Similarly, once we start hitting the lower bucket numbers and a request comes back empty, we should also be able to exit early.

A census algorithm that still reliably gets us 99.9% visibility into the nodes of the network and that requires 10x fewer network requests is an improvement.

Support for HTTP(S) JSON-RPC interface in glados_audit

It would be nice to have support for HTTP for the JSON-RPC interface with the Portal client, instead of only IPC.

Fluffy does not support IPC.

ethereum / glados Goto Github PK

glados's Issues

Description

Resolution

Description

Description

Origin

Full logged error

Solution

Description

Resolution

Current flags

What is wrong

How can it be fixed

Description

Description

Description

Solution

What is wrong

How can it be fixed

Tasks

Description

Description

Description

Solution

Record the sub-protocol in the glados-db.

Make mpsc channel more broad

Lookup sub-protocols at audit time

Considerations

Alternatives

What is wrong

How can it be fixed

what is wrong?

how can it be fixed.

Description

Discussion

Actions

Description

Discussion

Recommend Projects

Recommend Topics

Recommend Org

Jobs