osm-search / nominatim-data-analyser Goto Github PK
View Code? Open in Web Editor NEWQA Tool for Nominatim. Helps to improve the OpenStreetMap data quality and therefore the Nominatim search results.
License: GNU General Public License v2.0
QA Tool for Nominatim. Helps to improve the OpenStreetMap data quality and therefore the Nominatim search results.
License: GNU General Public License v2.0
It would be nice to have timestamp prefixes for every log line, so it is easy to spot long-running steps. For what it is worth, Nominatim uses: https://github.com/osm-search/Nominatim/blob/925195725dfcb7f1a6795c50244c1df6cb7242ce/nominatim/cli.py#L79
When iterating over zooms until reaching max zoom, the amount of tile increases exponentially. Calling index.getTile(z, x, y) for each tile is enough to make the execution time way longer as we reach higher zoom level.
The generation could be made much faster by ignoring tiles for which we have already seen that they contain no features or clusters.
We should try to use a data structure with O(1) access time to store ignored tile.
https://www.openstreetmap.org/node/9122275031
is next to the Otto-Weidt-Platz:
https://www.openstreetmap.org/way/642581102
Hi,
To fix an error found by the tool, the user has to right-click the link>Copy Link, switch to JOSM windows, hit Ctrl+Shift+O and enter to load the object. Can you please add a link to open in JOSM next to the 'Node ID' link (shown in the pop-up bubble) ?
When tippecanoe returns an error, then an exception is thrown while the error is being printed:
tippecanoe: must specify -o out.mbtiles or -e directory
Traceback (most recent call last):
File "/srv/qa-data.nominatim.org/Nominatim-Data-Analyser/analyser/core/pipes/output_formatters/vector_tile_formatter.py", line 42, in call_tippecanoe
result = subprocess.run(
File "/usr/lib/python3.8/subprocess.py", line 516, in run
raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['tippecanoe', '--output-to-directory=/srv/qa-data.nominatim.org/qa-data/addr_housenumber_no_digit/vector-tiles', '--force', '--no-tile-compression', '--no-tile-size-limit', '--no-feature-limit', '--buffer=120', '--no-clipping', '-r1', '--cluster-distance=60']' returned non-zero exit status 1.
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "cli.py", line 17, in <module>
Core().execute_all()
File "/srv/qa-data.nominatim.org/Nominatim-Data-Analyser/analyser/core/core.py", line 28, in execute_all
self.execute_one(file_without_ext)
File "/srv/qa-data.nominatim.org/Nominatim-Data-Analyser/analyser/core/core.py", line 36, in execute_one
PipelineAssembler(loaded_yaml, name).assemble().process_and_next()
File "/srv/qa-data.nominatim.org/Nominatim-Data-Analyser/analyser/core/pipe.py", line 38, in process_and_next
result = pipe.process_and_next(result)
File "/srv/qa-data.nominatim.org/Nominatim-Data-Analyser/analyser/core/pipe.py", line 38, in process_and_next
result = pipe.process_and_next(result)
File "/srv/qa-data.nominatim.org/Nominatim-Data-Analyser/analyser/core/pipe.py", line 38, in process_and_next
result = pipe.process_and_next(result)
File "/srv/qa-data.nominatim.org/Nominatim-Data-Analyser/analyser/core/pipe.py", line 36, in process_and_next
result = self.process(data)
File "/srv/qa-data.nominatim.org/Nominatim-Data-Analyser/analyser/core/pipes/output_formatters/vector_tile_formatter.py", line 28, in process
self.call_tippecanoe(self.base_folder_path, feature_collection)
File "/srv/qa-data.nominatim.org/Nominatim-Data-Analyser/analyser/core/pipes/output_formatters/vector_tile_formatter.py", line 61, in call_tippecanoe
self.log(logging.FATAL, e)
File "/srv/qa-data.nominatim.org/Nominatim-Data-Analyser/analyser/core/pipe.py", line 84, in log
LOG.log(level, f'Rule <{self.exec_context.rule_name}> : {msg}')
File "/usr/lib/python3.8/logging/__init__.py", line 1508, in log
raise TypeError("level must be an integer")
TypeError: level must be an integer
Command exited with non-zero status 1
This looks like a bug in the log()
function in pipe.py.
The analyser should collect some statistics over the number of errors it finds in each run. In the long run this should be displayed in the front end but for now it would just be useful to have the data on the server. I'd like to be able to compare runs before and after changes to the Nominatim code.
My suggestion would be to simply log the data in a table in the database. That makes it easy to generate summaries as required. A simple table would do with columns for date, name of the QA check and number of errors. Maybe add an extra_data column in JSONB, so we are future proof against any additional data we might want to save in the future.
waterway=boatyard is for "a place for constructing, repairing and storing vessels out of the water" but triggers the addr:* tags on non-addressable places check.
Although I suspect this tag is inappropriate in my particular case, I see no reason a boatyard couldn't have an address.
This layer has a number of thousands of errors in Australia. Most of them seem to be places that have been tagged place=farm and represent a single farming property, which would also have a street address. The same problem is probably happening with place=isolated_dwelling and place=plot.
Example: https://www.openstreetmap.org/way/497029571
https://nominatim.org/qa/#map=16.90/39.10/-108.42&layer=addr_street_wrong_name
In this instance, the address is wrongly detected as belonging to 34 1/4 Road, when it properly belongs with 34 Road. The building with the address has a service road (highway=service
service=driveway
) leading from the proper road to the building.
Getting a warning here about two place nodes named "Bruchmühlen" being close to each other.
https://nominatim.org/qa/#map=12.95/52.21/8.43&layer=place_nodes_close
This is a false positive as the village is (for reasons I never really fully understood) split by a state border, the southern and western part belonging to Lower Sachsony, and the north-east part to North Rhine Westfalia
So the two are indeed two separate administrative entities known by the same name, but having different ZIP codes (32289 vs. 49328), are in different states and districts, and have different car license plate letters ("HF" vs. "OS")
It often indicates address problems, road data problems or both.
https://wiki.openstreetmap.org/wiki/Mr%C3%B3wki runs such QA server for Poland and it was/is very useful.
pink: 150m - 1500m from road to address
yellow: over 1500m from road to address
I'm getting a false positive here, saying that the nodes streetname "Jahnplatz" conflicts with the parent "Jahnplatz (U)"
The node is on the buildings ground floor over ground, while "Jahnplatz (U)" is actually at layer=-2 underground (and wrongly labeled as "raiyway=platform" and "highway=footpath"):
https://nominatim.org/qa/#map=21.03/52.02/8.53&layer=addr_street_wrong_name
Node is: https://www.openstreetmap.org/node/5139522754
Platform way is: https://www.openstreetmap.org/way/260631154
It's probably a rare situation to have overlapping streets at different levels, and in this specific case the tagging is also questionable, but it may make sense to prioritize streets on the same level as the object tagged with addr:street in this check?
Add some documentation for the clustering-vt to explain how it works.
When I am zoomed out and click the link for an error on node 5164325981 JOSM loads the wrong area. If I zoom in, I get the correct area when I click the link. I suspect coordinate rounding, as the area I am sent to is nearby, but does not contain the error.
related: #17
the html data is there, the max-height: 500px
is just too small to show it at the bottom
see: https://nominatim.org/qa/#map=19/52.53776/13.36440&layer=addr_street_wrong_name
for object https://www.openstreetmap.org/way/891386318
it is not immediately obvious which street or way is referenced as parent
Adding multi threading to the tool would reduce a lot the time needed to execute all the rules. As python threads run concurrently the real benefit will come when query are executed on the PostgreSQL server (maybe by using a connection pool) and when we call clustering-vt. Clustering-vt and PostgreSQL queries are the most time consuming operations when executing a rule so threads would definitively improve the performance of the tool.
It can be interesting to check if it would be better to execute a whole rule in its own thread or if threads would be spawn locally when executing PostgreSQL queries and when calling clustering-vt.
In the first case, we should make sure some operations are thread safe, for example when accessing the config or when writing/reading to files.
I'm not sure if the directory was added by mistake. There is a build
directory mentioned in the .gitignore
file (near the end of the file).
https://nominatim.org/qa/#map=2.91/0.00/0.00 links https://github.com/AntoJvlt/Nominatim-Data-Analyser/issues that redirects to https://github.com/AntoJvlt/Nominatim-Data-Analyser/pulls
please come to the "Issues" section of the github repository to discuss this.
Linking issues and mentioning issues in repo with disabled issues seem not intentional
The "Suspicious addr:street tag" layer is missing quite a few nodes that Nominatim assigns a street with a name different from the value of their addr:street tag, because the value isn't in the name
field of any nearby highway
objects.
Some example nodes
The street names are recorded in name:left
or name:right
of the highway objects (example), but I thought Nominatim ignores those tags? Since these nodes don't work correctly in Nominatim (it assigns the correct street but then shows the address as 21 Leith Walk, see here), it's odd that the QA tool doesn't flag them.
Osmoscope adds URL fragments with current map position and zoom whenever one moves the map, e.g. #map=17.991666666666667/6.8867/52.24745
This makes it easy to share the current view with other users.
It would be useful to have the date of the database, so we can see if the updates worked and the right data is shown. The date can be queried from the database with SELECT lastimportdate FROM import_status
.
https://nominatim.org/qa/#map=18.16/49.32/-123.14&layer=addr_street_wrong_name turns up a bunch of errors, for example one with this node
It claims "street_name: Park Royal S" and "parent_name: Park Royal South". Assuming that addr:street is the street_name, it is not set on the way, and the tagging has not changed recently.
addr:street was present on an enclosing way that was not in the area I initially downloaded.
The instructions make no mention of addr:street tags on different objects.
Rerunning the tile generation usually only adds new layers and leaves existing ones untouched. This is a good behaviour when you want to just regenerate a single layer. But there are some corner cases where you end up with bogus entries in the file:
WebPrefixPath
, all entries now exist with the old and new prefixTwo possible solutions to the problem:
--execute-all
always create the layers.json file from scratchThe .gitignore
file already contains node_modules
but there's a folder full of modules in the clustering-vt
subdirectory.
We generate about 3.5G of vector tiles at the moment which need to be updated daily.
One obvious update strategy would be to just overwrite the existing tiles with new ones. But this may leave 'ghost tiles' where data has gone and no new tile is generated. So old data needs to be deleted. Removing the entire data set before regenerating the new one is not an option because then the vector tiles would not be accessible while the new ones are generated. That means the new tiles need to be generated in a temporary location and then switch the new tile. This is a workable solution for now but means that every day 3.5G of data needs to be deleted, which takes quite a bit of time because the directory consists of lots of small files.
It would be nice if the vector tile generator could support in-place updates that overwrites existing tiles and deletes ghost tiles if needed. Bonus points if it also just not writes new tiles when the content hasn't changed.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.