GithubHelp home page GithubHelp logo

baturin / wikivoyage-listings Goto Github PK

View Code? Open in Web Editor NEW
48.0 48.0 26.0 43.16 MB

Data extracted from Wikivoyage, the free travel guide at http://wikivoyage.org. Leverage Wikivoyage listings on your smartphone, or in your own mashups.

Home Page: http://wvpoi.batalex.ru/

License: Other

Java 92.19% Shell 3.20% HTML 2.95% Batchfile 1.67%

wikivoyage-listings's People

Contributors

baturin avatar cafeina-software avatar dazyoung avatar imanc avatar kevin0x90 avatar nicolas-raoul avatar nsaiisasidhar avatar olgfok avatar patrickwieth avatar prawda1 avatar zstojanovic avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

wikivoyage-listings's Issues

Cache dumps

Rather than downloading everytime for instance frwikivoyage/latest/frwikivoyage-latest-pages-articles.xml.bz2, we should:

  1. Open the versions list page https://dumps.wikimedia.org/frwikivoyage/
  2. Select the latest version
  3. Download it as dump_fr_20151102.xml.bz2 rather than just dump.xml.bz2
  4. Reuse the dump next time if no new version is available.
  5. Old version should be deleted when newer version is downloaded

This is more sustainable than implementing caching mechanisms outside of the tool. This ensures that other people who use the tool do not overload the poor donation-funded Wikimedia servers.

In live API, include images

This request:
http://wvpoi.batalex.ru/api/get-listings/?language=en&positional_data=true&limit=200&format=geojson&min_latitude=41.97480631113838&max_latitude=42.582410848693954&min_longitude=-71.62948608398438&max_longitude=-70.06668090820312
does not contain any image parameter, despite the listings having images, for instance "JFK Presidential Library and Museum" has an image at https://en.wikivoyage.org/wiki/Boston/Dorchester

Including images would allow for more beautiful applications to be built on top of the API.

Output unrecognized listing arguments to a log file

Some widely-used parameters might escape our attention (example: lastedit for English or wifi in French).

If we outputted non-recognized parameters to a log file, we could detect such cases.

This log file could also be useful to Wikignomes, by the way.

Add Wikidata property

The English and French Wikivoyages have a wikidata="Q1234567" property.

As you can guess, this property will create many new opportunities to compare data and collaborate between languages.

Filter out invalid data

People will be more willing to reuse our data if it is guaranteed valid.

For instance, a malformed latitude makes my GPS app refuse the whole file.
And unfortunately there are always some malformed latitude, and there will always be some in the wikitext.

I think we should prevent invalid data from being used in the output format.

How do you think that could be implemented?
My current idea: After generating the list of WikivoyagePOI objects, generate validationreport, then using the existing validators filter out broken strings from the list of WikivoyagePOI objects, and give that to the other output classes.

wvpoi.batalex.ru downloads out-of-date

Are you too busy to maintain the server maybe?
Wikivoyage is now getting a now of Wikidata identifier, so the data is getting more and more interesting for people :-)
Cheers!

Include diplomatic representation's operator, and whether it is an embassy or consulate

This is phase 2 following Dario's great work at #23.

The operator of the USA consulate in Frankfurt is the USA.

In the English Wikivoyage it is in the "Embassies" section, so we know that it is a consulate: https://en.wikivoyage.org/wiki/Frankfurt#Cope

We should add an "operator" property to Listing.java and accordingly a column in the CSV.
Also, when we know whether a diplomatic-representation is a embassy or consulate, we should set the listing type to "embassy" or "consulate". When we don't know, "diplomatic-representation" is fine.

Other examples: https://en.wikivoyage.org/wiki/Lyon https://en.wikivoyage.org/wiki/Paris

Some Wikivoyages seem to not make the difference: https://fr.wikivoyage.org/wiki/Lyon#Repr.C3.A9sentations_diplomatiques

Remove invalid Wikidata identifier from output

The CSV/SQL/etc files should not contain invalid QIDs.

Redirects might be thought as tolerable, but better remove them too, I would say, as there are not many and they might break some weakly-built tools.

Add Wikidata QID of destination article

The column could be called "destination_qid" or similar.
Example: For the restaurant "Chez Robert" in the Nouméa article, the destination_qid value would be the QID of Nouméa.

That does not cost us much, and that solves huge headaches for people trying to use the data, especially when doing pivots using several language versions.

French wikivoyage input

I am trying to add French wikivoyage input.
The syntax is quite different from Russian/English:

* {{Voir
| nom= | alt= | url= | email= | wikipédia= | wikidata= | facebook= 
| adresse= | latitude= | longitude= | direction= 
| image = 
| téléphone= | numéro gratuit= | téléphone portable= | fax= 
| prix= | horaire= 
| wifi= | handicap= | mise à jour=2015/11/16
| description=
}}

The listing types are:

aller
circuler
voir
faire
acheter
manger
sortir
se loger
ville
destination
représentation diplomatique
autres

I will try to match them to English types.

I can try to guess the language in some cases, but not when using -input-file... what do you suggest? Should we add a language command-line parameter?

More validations

I am experimenting with data.world and it gave some interesting ideas for validation:

Seen at 16 duplicate rows detected
    Row 1630 is a duplicate of the one above it.
    Row 29643 is a duplicate of the one above it.
    Row 65945 is a duplicate of the one above it.
    Row 77711 is a duplicate of the one above it.

+ 12 similar issues
158 blank cells detected
Dismiss

    Cell at row 1345, column title appears blank.
    Cell at row 1457, column title appears blank.
    Cell at row 1949, column title appears blank.
    Cell at row 2214, column title appears blank.

+ 154 similar issues
Numeric(114)
114 numeric values outside standard deviation detected
Dismiss

    Value 1.1102015E7 at row 66099, column checkin is more than 4 standard deviations of 646327.73 from the mean of 45586.41.
    Value 1.4102015E7 at row 66099, column checkout is more than 4 standard deviations of 778967.24 from the mean of 50554.01.
    Value -1800.0 at row 109847, column hours is more than 4 standard deviations of 248.63 from the mean of 8.14.
    Value 1130.0 at row 171909, column hours is more than 4 standard deviations of 248.63 from the mean of 8.14.

+ 110 similar issues
Noise(146)
2 non-numeric characters in number field detected
Dismiss

    Value at row 110516, column latitude does not appear to be numeric, but column is numeric.
    Value at row 110516, column longitude does not appear to be numeric, but column is numeric.

1 possible placeholder number detected
Dismiss

    Value 5555 at row 185359, column title appears to be a placeholder.

143 possible placeholder text values detected
Dismiss

    Value VVV at row 5521, column alt appears to be a placeholder.
    Value *** at row 12105, column alt appears to be a placeholder.
    Value CCC at row 39923, column alt appears to be a placeholder.
    Value *** at row 104112, column alt appears to be a placeholder.

+ 139 similar issues
Text(1,945)
1,945 text values outside standard deviation detected
Dismiss

    Text value 95 at row 508, column address has length more than 4 standard deviations away from the mean.
    Text value 88 at row 591, column address has length more than 4 standard deviations away from the mean.
    Text value 162 at row 599, column address has length more than 4 standard deviations away from the mean.
    Text value 144 at row 742, column address has length more than 4 standard deviations away from the mean.

+ 1,941 similar issues

Date(3)
3 dates detected far in the future
Dismiss

    Date is far in the future at column lastedit, row 77420.
    Date is far in the future at column lastedit, row 130787.
    Date is far in the future at column lastedit, row 205038.

Feel free to split this issue into separate issues, and implement only the ones you want.

For your information, here are the current validations: http://wvpoi.batalex.ru/download/listings/wikivoyage-listings-en-latest.validation-report.html

Remove "for OsmAnd" from project description

This project is now more than just for OsmAnd :-)

Current description is: "Parse listing from Wikivoyage to create data files for OsmAnd"

How about: "Data extracted from Wikivoyage, the free travel guide at http://wikivoyage.org. Leverage Wikivoyage listings on your smartphone, or in your own mashups."

Filter out HTML comments from parameters

Example:

*{{sleep
| name=AMBER HOUSE - at the centre! <!-- This is the actual, verbatim name of the B&B - see: http://en.wikivoyage.org/w/index.php?title=User_talk%3AIkan_Kekek&diff=2377723&oldid=2377566#Nelson --> | url=http://www.AmberHouse.co.nz | [email protected]
| address=46 Weka St | lat=-41.26677 | long=173.29322 | directions=When entering the city of Nelson from the SH6 roundabout, turn first right from Trafalgar St opposite Rugby Ground and then Wainui St becomes Weka St after 300m
| phone=+64 3 539-0605, +64 21 202 4961 (Mobile) | tollfree= | fax=+44 7005 963 437 (fax server in Northern Ireland) | image=Amber_House,_Nelson,_New Zealand,_2005-11-16T01-33Z.jpg
| hours= | price=$79-249
| checkin=by arrangement, usually after 14:00 | checkout=usually before 10:35
| content=Open all year, clean Bed and Breakfast in a lovely 1897 villa that used to be a school for girls and ''little'' boys. Traditional Rose Garden with the oldest walnut tree in the South Island hidden away at the back of the plot. One of the few that still offers a full cooked breakfast. Full board and room service available. Now has satellite HD TV, Wi-Fi, double glazing and air-conditioning. Bedrooms have ''en-suite'' showers. Quiet fringe of CBD location. Smoking or smokers not allowed(!) (''party of 4 from $37 each'') The Amber family first came to Nelson in 1842 but can understand some French, Fukien Chinese, German, Malay and Spanish.}}

Right now title is the whole AMBER HOUSE - at the centre! <!-- This is the actual, verbatim name of the B&B - see: http://en.wikivoyage.org/w/index.php?title=User_talk%3AIkan_Kekek&diff=2377723&oldid=2377566#Nelson -->, it should be only AMBER HOUSE - at the centre!

Other example:

* {{listing
| name=[[Tombstone|Tombstone]], [[Arizona]]<!-- note: the piped syntax is necessary for the dynamic map to show a correct link -->
| directions=
| lat=31.72247 | long=-110.07726
| content=A legendary Western town.
}}

Export to KML

GPX is not supported by Maps.me, Google Earth and other apps

Invalid CSV produced

This line produces invalid CSV.
Yes the line is problematic (several listings mistakenly bundled into one). But even if the content is broken (which always happens), we should never produce invalid CSV.

* {{buy | image=
|name=Ramstore Atrium Supermarket Trade Center |alt=Рамстор Атриум |url=http://www.ramstore.kz |email=|address=Ul. Raiymbek Nauryzbai Batyr (Райымбек даңғылы) |lat=43.26812 | long=76.93390|directions=Metro: Raiymbek batyr |phone=+8 727 244-6556|fax=+8 727 244-6530 |hours=Open: 09:00-23:00|price= |content=International retrail store chain. - 9 cashdesk with electronic scales; - Accept Moneycards as: Visa, Master Card, Euro Card, American Express, China Union Pay, Altyn. - More units: Samal Shopping center, Address: Furmanov str., 226 (Phone:+8 727 330-5501, Fax:+8 727 258-7570, Open: 9:00-24:00); - Tastak Supermarket, Address : Almaty, Tole Bi str., 229, (Phones:+8 727 2414008, Fax: 8 727 2414015, Open: 09:00-23:00). - Aynabulak Supermarket, Address: Almaty, micro-district Aynabulak, 98 B. (Phones: +8 727 299-4009, 2994069, Fax : +8 727 2994009, Open: 09:00-23:00). - Atakent Supermarket, Address: Almaty, Temiryazev str., 42. (Phones:+8 727 275-6833, +8 727 275-6835, 275-6838, Fax:+8 727 275-7289, Open: 09:00-23:00). - Mega Supermarket, Address: Almaty, Rozibakiyev str., 247 А. (Phones: +8 727 232-2612, +8 727 232-2614, +8 727 271-9740, Fax:+8 727 232-2613, Open: 09:00-23:00). - Hyper Aport Hypermarket, Address : Moll А’port, Karasay district (200 m from the Market Altyn-Orda, Six km west of the City Center. Phone:+8 727 312-15-61, Fax:+8 727 312-1563, Open: 09:00-23:00). - Mango Supermarket, Address: Almaty, st. Sholohova/Seyfullina 29. (Phone:+8 727 313-7522, 313-7520, Fax:+8 727 313-7522, Open: 09:00-23:00). - Globus Supermarket, Address: Almaty, st. Abaya/Auezova 109 b. (Phone:+8 727 356-7564, Fax :+8 727 356-7562, Open: 09:00-23:00. - Almagul Supermarket, Address: Almaty, microdistrict Almagul 18a. (Phone:+8 727 396-2507. Fax :+8 727 396-2508, Open: 09:00-23:00). - {{buy | lat=43.2730 | long=76.9384 }} Hyper Altyn Taraz Hypermarket /Рамстор Алтын Тараз/, Address: Str. Abylai Han 3 Moll - Altin-Taraz , Желтоқсан көшесі (Near to Astan - 2 Station). Phone:+8 727 244-6102, Fax :+8 727 244-6106, Open: 09:00-23:00).- Sputnik Supermarket, mkr-n Mamyr-1, 8a Moll Sputnik. (Phone:+8 727 244-7519, Open: 09:00-23:00). - Shemyakina Supermarket, Address: Shemyakina str. 121, (Phone:+8 727 303-4048, Open: 09:00-23:00). - Timiryazeva Supermarket, Address: Almaty, Str. Timiryazeva 37, (Phones: +8 727 248-4631, 248-4633, 248-4635, Open: 09:00-23:00).
}}

up-for-grabs.net

Hi, I don't know if this is the correct place to write but I want to get involved in open source and I found this project on up-for-grabs.net. Any help needed?

Parse diplomatic representations (embassies and consulates)

They are listed like this in the English Wikipedia:

* {{flag|Afghanistan}} {{listing
| name=Afghanistan | url=http://embassyofafghanistan.org/ | email=
| address=2341 Wyoming Ave NW | lat= | long= | directions=
| phone=+1 202 234-3770 | tollfree= | fax=
| hours= | price=
| content=
}}

It would be a new type of listing, called maybe "diplomatic-representation".

Only use complete dumps

The folder on the server is created before all files are ready. Until ready it is called a "Partial dump".

screenshot from 2016-08-01 17-59-11

This makes the tool fail if we are unlucky enough to run at the wrong time.

Filter out "{{dead link" from validation report

Most of the validation report is lines like this:

Invalid URL 'http://www.stockholmpiecompany.com {{dead link|May 2016}}'

There are tools within the Wikivoyage website to deal with these, so better filter them out to focus on issues that are not easily found on the website.

Ideas for future validations

A few ideas for more validation.
Low urgency so no need to integrate them into ValidationReport.java now, this can be a memo for one-time SQL queries for now.

Global checks

Find listings with a wikipedia value but no wikidata value:

SELECT article, title FROM wikivoyage_listings WHERE wikipedia != "" AND wikidata == "";

Find listings with duplicate coordinates:

SELECT
    article, title, latitude, longitude, COUNT(*)
FROM
    wikivoyage_listings
GROUP BY
    latitude, longitude
HAVING 
    COUNT(*) > 1
ORDER BY
    COUNT(*) DESC;

Listings with same name in same article:

SELECT
    article, title, COUNT(*)
FROM
    wikivoyage_listings
WHERE
    title != ""
GROUP BY
    article, title
HAVING 
    COUNT(*) > 1
ORDER BY
    COUNT(*) DESC;

URL validation

?utm_source= is always unnecessary in URLs, it is a tracking code.

JSON OutputFormat

JSON is the easiest format for creating webapps and mashups.

Having a JSON (not compressed) file hosted on a server would allow people to easily write interesting applications.

The size might be too big if all details are included, though. maybe just coordinates + article + title?

Run templates in parameters

The French wikivoyage uses a template for prices, for instance: {{Prix|3.2|€}}

Right now they are ignored, which means {{Prix|3.2|€}} becomes an empty string.

Running the actual templates code sounds difficult, so we could try to re-implement them in Java.

Any better idea?

Multiple calls to same getter, maybe use a variable

poi.getUrl() called 3 times at L14 and 15, and the getter has an If, it's not a simple property fetch

if (poi.getUrl() != null && !poi.getUrl().equals("")) {

Just a minor gripe, I didn't find a Discussion place for this, and pretty sure this won't be high priority and optimization will come really late in the timeline, so feel free to delete this issue.
Other files have this, too:

if (poi.getEmail() != null && !poi.getEmail().equals("")) {

and Line 9

Generic XML output, with all parameters

Some POIs have links to the websites of the venue associated with their title, but the website link is omitted in the XML, would it be possible to change that in the conversion tool?

Cheers & Thanks!

Aborting during download/extract leaves a broken file

Steps:

  1. Execute the script
  2. Abort (for instance with CTRL-C)
  3. Execute again
  4. Error appears:
[2018-02-27 10:53:14] Use cached dump                                             
[2018-02-27 10:53:14] Parse dump                                                   
[2018-02-27 10:53:14] Save to '../wikivoyage.github.io/wikivoyage-listings-fr.csv'
Failure                                                                                                  
org.wikivoyage.listings.input.DumpReadException: Failed to get article in Wikivoyage dump: error when reading XML
        at org.wikivoyage.listings.input.DumpArticlesIterator.getNext(DumpArticlesIterator.java:189)
        at org.wikivoyage.listings.input.DumpArticlesIterator.next(DumpArticlesIterator.java:59)
        at org.wikivoyage.listings.input.DumpListingsIterator.getNext(DumpListingsIterator.java:40)
        at org.wikivoyage.listings.input.DumpListingsIterator.next(DumpListingsIterator.java:57)
        at org.wikivoyage.listings.input.DumpListingsIterator.next(DumpListingsIterator.java:17)
        at org.wikivoyage.listings.validators.SimpleValidator$SimpleValidatorIterator.next(SimpleValidator.java:41)
        at org.wikivoyage.listings.validators.SimpleValidator$SimpleValidatorIterator.next(SimpleValidator.java:27)
        at org.wikivoyage.listings.validators.SimpleValidator$SimpleValidatorIterator.next(SimpleValidator.java:41)
        at org.wikivoyage.listings.validators.SimpleValidator$SimpleValidatorIterator.next(SimpleValidator.java:27)
        at org.wikivoyage.listings.validators.SimpleValidator$SimpleValidatorIterator.next(SimpleValidator.java:41)
        at org.wikivoyage.listings.validators.SimpleValidator$SimpleValidatorIterator.next(SimpleValidator.java:27)
        at org.wikivoyage.listings.validators.SimpleValidator$SimpleValidatorIterator.next(SimpleValidator.java:41)
        at org.wikivoyage.listings.validators.SimpleValidator$SimpleValidatorIterator.next(SimpleValidator.java:27)
        at org.wikivoyage.listings.validators.WikidataValidator$WikidataValidatorIterator.validateNextBatch(WikidataValidator.java:60)
        at org.wikivoyage.listings.validators.WikidataValidator$WikidataValidatorIterator.next(WikidataValidator.java:51)
        at org.wikivoyage.listings.validators.WikidataValidator$WikidataValidatorIterator.next(WikidataValidator.java:35)
        at org.wikivoyage.listings.output.CSV.write(CSV.java:60)
        at org.wikivoyage.listings.Main.generateFileForFormat(Main.java:235)
        at org.wikivoyage.listings.Main.main(Main.java:98)
Caused by: javax.xml.stream.XMLStreamException: ParseError at [row,col]:[197812,34]
Message: unexpected end of stream
        at com.sun.org.apache.xerces.internal.impl.XMLStreamReaderImpl.next(XMLStreamReaderImpl.java:599)
        at org.wikivoyage.listings.input.DumpArticlesIterator.getNext(DumpArticlesIterator.java:183)
        ... 18 more

Workaround: Remove all files in the dumps-cache folder.

The tool should download/extract under a temporary name, and only give the final filename after the extraction finishes (successfully).

Modify getFlagElement() method

The Template for flag is named in respective local languages such as Template: Bandera in Spanish, Template: Drapeaux in French etc.. and so the getFlagElement() method needs to be changed accordingly.

Correct me if I'm wrong.

Add Italian

The Italian Wikivoyage has expressed interest in getting parsed by us.

I believe it will not be very difficult, copying English.java or German.java will probably work. There might be a few things to adjust.

The first thing to do would be to find a listing on the Italian Wikivoyage and compare it to listings on other Wikivoyages to see which one is the closest.

Add listing image to the Wikidata item if it does not have one yet

It would be cool if we started pouring listings information into Wikidata.
That would make Wikivoyage's info even more reusable than our CSV/etc dumps, with the added benefit of being real-time.
In a few centuries listings could even be entirely defined on WIkidata, who knows?

Map of validation report results

The report results are grouped by article (which is great for efficient edition), but in addition I think the articles should be either:

  • Organized on a map, so that people can process errors in places they are familiar with.
  • Or randomized, so that people don't always process the same articles, and to prevent potential false-positives from making up most of the beginning of the list.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.