GithubHelp home page GithubHelp logo

macbre / wbc.macbre.net Goto Github PK

View Code? Open in Web Editor NEW
0.0 2.0 0.0 526 KB

Alternative WBC archive web interface

Home Page: http://wbc.macbre.net

License: MIT License

Makefile 2.32% Shell 0.17% Python 70.67% HTML 10.03% CSS 10.28% JavaScript 4.67% Dockerfile 1.86%
wbc-archive wbc archives ngnix python docker-compose sphinx-search poznan archive djvu

wbc.macbre.net's Introduction

wbc.macbre.net

Build Status

WBC archive served via API and as a web fornt-end application.

Architecture

Docker Compose running the following:

  • Manticore Search instance (this is a fork of Sphinx)
  • Flask-powered app providing HTTP API

macbre/wbc can fetch and convert DJVU files to XML format that can be indexed by SphinxSE.

Development

Run the following:

docker-compose up -d sphinx
cd app && virtualenv env -p python3.8 && source env/bin/activate && pip install -e . && ./server_debug.sh

The local instance of wbc.macbre.net should be ready at http://0.0.0.0:8080/

API

Needs to be prefixed with /api/v1 (e.g. /api/v1/search?q=foo)

Publications

GET /publications

List of all publications

GET /publications/{id}

Meta data of a given publication

Issues

GET /issues/{id}

Get all documents in a given issue

Documents

GET /documents/{id}

Get a given document

GET /documents/{id}.txt

Get a given document in txt file format

Search

GET /search?q={query}

Search within all publications

Suggest

GET /suggest?q={query}

Return search suggestions

schema.org

Certificate renewal

acme.sh --issue -d wbc.macbre.net  --stateless --force

Content indexing

  • get XML content from http://s3.macbre.net/wbc/kronika_gazeta_wielkiego_ksiestwa.xml.gz (indexed by macbre/wbc)
  • run make index to index XML file in sphinx
using config file '/opt/sphinx/conf/sphinx.conf'...
indexing index 'wbc'...
collected 11980 docs, 246.9 MB
sorted 35.1 Mhits, 100.0% done
total 11980 docs, 246858497 bytes
total 318.765 sec, 774419 bytes/sec, 37.58 docs/sec
total 97 reads, 1.865 sec, 2095.4 kb/call avg, 19.2 msec/call avg
total 1650 writes, 0.733 sec, 390.8 kb/call avg, 0.4 msec/call avg

wbc.macbre.net's People

Contributors

dependabot[bot] avatar imgbotapp avatar macbre avatar

Watchers

 avatar  avatar

wbc.macbre.net's Issues

Google Sign-In

<script src="https://apis.google.com/js/platform.js" async defer></script>

<meta name="google-signin-scope" content="profile email">
<meta name="google-signin-client_id" content="YOUR_CLIENT_ID.apps.googleusercontent.com">

<div class="g-signin2" data-onsuccess="onSignIn" data-width="300" data-height="200" data-longtitle="true"></div>

<script>
function onSignIn(googleUser) {
  var profile = googleUser.getBasicProfile();
  console.log('ID: ' + profile.getId()); // Do not send to your backend! Use an ID token instead.
  console.log('Name: ' + profile.getName());
  console.log('Image URL: ' + profile.getImageUrl());
  console.log('Email: ' + profile.getEmail());
}
</script>

<a href="#" onclick="signOut();">Sign out</a>
<script>
  function signOut() {
    var auth2 = gapi.auth2.getAuthInstance();
    auth2.signOut().then(function () {
      console.log('User signed out.');
    });
  }
</script>
from oauth2client import client, crypt

# (Receive token by HTTPS POST)

try:
    idinfo = client.verify_id_token(token, CLIENT_ID)

    # Or, if multiple clients access the backend server:
    #idinfo = client.verify_id_token(token, None)
    #if idinfo['aud'] not in [CLIENT_ID_1, CLIENT_ID_2, CLIENT_ID_3]:
    #    raise crypt.AppIdentityError("Unrecognized client.")

    if idinfo['iss'] not in ['accounts.google.com', 'https://accounts.google.com']:
        raise crypt.AppIdentityError("Wrong issuer.")

    # If auth request is from a G Suite domain:
    #if idinfo['hd'] != GSUITE_DOMAIN_NAME:
    #    raise crypt.AppIdentityError("Wrong hosted domain.")
except crypt.AppIdentityError:
    # Invalid token
userid = idinfo['sub']

Bad HTTP/0.9 request type - put an app behind ngnix

Aug 11 21:50:55 bjornoya app[529]: ERROR:werkzeug:109.104.203.62 - - [11/Aug/2016 19:50:55] code 400, message Bad HTTP/0.9 request type ('\x03\x00\x00)$à\x00\x00\x00\x00\x00Cookie:')
Aug 11 21:53:45 bjornoya app[529]: INFO:werkzeug:37.26.128.179 - - [11/Aug/2016 19:53:45] "CONNECT hideface.tk:80 HTTP/1.1" 404 -
Aug 11 21:53:45 bjornoya app[529]: INFO:werkzeug:37.26.128.179 - - [11/Aug/2016 19:53:45] "CONNECT 50na50.net:80 HTTP/1.1" 404 -
Aug 11 21:53:45 bjornoya app[529]: ERROR:werkzeug:109.104.203.62 - - [11/Aug/2016 19:53:45] code 400, message Bad HTTP/0.9 request type ('\x03\x00\x00)$à\x00\x00\x00\x00\x00Cookie:')
Aug 11 21:53:45 bjornoya app[529]: INFO:werkzeug:109.104.203.62 - - [11/Aug/2016 19:53:45] "#003
Aug 11 21:53:45 bjornoya app[529]: INFO:werkzeug:37.26.128.179 - - [11/Aug/2016 19:53:45] "GET http://b1.50na50.net/myiphc.php?rnd=320c39084a61ab518d773518747bdf5e&rn=147094523070248 HTTP/1.1" 404 -

Redis: search suggestions

$ redis-cli 
127.0.0.1:6379> ZADD autocomplete 0 poznańskiego 0 poznań 0 tej 0 tak
(integer) 4

127.0.0.1:6379> ZRANGEBYLEX autocomplete [poz "[poz\xff" LIMIT 0 10
1) "pozna\xc5\x84"
2) "pozna\xc5\x84skiego"

http://www.opensearch.org/Specifications/OpenSearch/Extensions/Suggestions/1.1

["sea",
  ["sears",
   "search engines",
   "search engine",
   "search",
   "sears.com",
   "seattle times"],
  ["7,390,000 results",
   "17,900,000 results",
   "25,700,000 results",
   "1,220,000,000 results",
   "1 result",
   "17,600,000 results"],
  ["http://example.com?q=sears",
   "http://example.com?q=search+engines",
   "http://example.com?q=search+engine",
   "http://example.com?q=search",
   "http://example.com?q=sears.com",
   "http://example.com?q=seattle+times"]]

Przypisy

http://wbc.macbre.net/document/6947/parafia-ewangelicka-w-staroece-ikrzesinach-1907-1945.html

...łamanie oporu ewangelickich osadników tej miejscowości z niechęcią przyjmujących perspektywę wcielenia do parafii w Starołęce 13 .

...

13 APB KEP sygn. 5097, n.p.: Superintendent Staemmler an das Kgl. Konsistorium, Posen, 15.7.1907.
<sup id="cite_1" class="reference"><a href="#cite_note-1">[1]</a></sup>

...

<ol class="references">
  <li id="cite_node-1"><sup><a href="#cite_1">1</a></sup>
    <span class="reference-text">Parafia ewangelicka w Starołęce i Krzesinach (1907-1945) w: <a rel="nofollow" class="external text exitstitial" href="http://www.wbc.poznan.pl/dlibra/docmetadata?id=161589">„Kronika Miasta Poznania” nr 4/2009, „Starołęka, Głuszyna, Krzesiny”</a>, Poznań, Wydawnictwo Miejskie, 2009.</span>
  </li>
</ol>

http://wbc.macbre.net/document/5614/proces-w-marcu-1610-roku.html

http://wbc.macbre.net/document/3360/kronika-miasta-poznania.html

wołaniem wojska byŁoby oswobodzenie Was'zawy i przeniesipnie do niej rządu narodowego... W takich okoliczno w Poznańskiem uczyniono..."16).

PRZYPISY 1) Praca niniejsza jest drugim, nieco zmienionym wydaniem rozpra.wki pOdpisanego pt. "Rząd Narodowy w 1848''. ogłoszonej w R o c z n i k a c h H i s t o r y c z n y c h, Poznań 1935. str. 197 i n. Ogłoszone w ciągu lat osta,tnich nowe prace z tego okresu wymagaJy zajęcia stanowiska przez autora z powodu kontrowersji zdań. Historię polityczną powstania 1848 r. czeństwo polskie w powstaniu poznańskim 1848. Warszawa 1935. Korzystam z niej obficie, koncentrując uwagę na problemie rządu narodowego.  ... 16) Kieniewicz, str. 111.

https://wbc.macbre.net/document/6458/alkohol-jako-bron-taktyczna.html

Search suggest via SphinxSE + CALL SUGGEST

CALL QSUGGEST('automaticlly ','forum');
+---------------+----------+------+
| suggest       | distance | docs |
+---------------+----------+------+
| automatically | 1        | 282  |
| automaticly   | 1        | 6    |
| automaticaly  | 1        | 3    |
| automagically | 2        | 14   |
| automtically  | 2        | 1    |
+---------------+----------+------+

Instead of (introduced in #1):

$ redis-cli 
127.0.0.1:6379> ZADD autocomplete 0 poznańskiego 0 poznań 0 tej 0 tak
(integer) 4

127.0.0.1:6379> ZRANGEBYLEX autocomplete [poz "[poz\xff" LIMIT 0 10
1) "pozna\xc5\x84"
2) "pozna\xc5\x84skiego"

Current index status:

mysql> SHOW INDEX wbc STATUS;
+-------------------+-----------+
| Variable_name     | Value     |
+-------------------+-----------+
| index_type        | disk      |
| indexed_documents | 6968      |
| indexed_bytes     | 122375401 |
| ram_bytes         | 131745248 |
| disk_bytes        | 194452765 |
+-------------------+-----------+

Generate XML sitemap with all documents

http://www.sitemaps.org/protocol.html

robots.txt

Sitemap: http://www.example.com/sitemap-index.xml
<?xml version="1.0" encoding="UTF-8"?>
<sitemapindex xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
   <sitemap>
      <loc>http://www.example.com/sitemap1.xml.gz</loc>
      <lastmod>2004-10-01T18:23:17+00:00</lastmod>
   </sitemap>
   <sitemap>
      <loc>http://www.example.com/sitemap2.xml.gz</loc>
      <lastmod>2005-01-01</lastmod>
   </sitemap>
</sitemapindex>

You can provide multiple Sitemap files, but each Sitemap file that you provide must have no more than 50,000 URLs and must be no larger than 10MB (10,485,760 bytes)

<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
   <url>
      <loc>http://www.example.com/</loc>
      <lastmod>2005-01-01</lastmod>
      <changefreq>monthly</changefreq>
      <priority>0.8</priority>
   </url>
</urlset> 

Usuń ASCII art

http://wbc.macbre.net/api/v1/documents/5978.txt

Restauracja czyściutka; obsługa nadzwyczaj miła, posiłek smaczny. "Chociaż raz bez peanut butter" (masło orzechowe, za którym Fiedlerowie przepadali i które z chlebem było nieodzowną częścią naszego 
,.....
.....,.
. \.

.'.
><h , '"  '. t" ra .',..'" ' .  'fi.. \ 11.;;' ",,,,,;\ :t:<.. ." , . I.. '."l'..''< > I":;"; ''''' ;'<' , . "':t-; u .... T f "; .'i' ,':, ., . II"; . ,: :r.:'" "'r:,',:f; .I F I . \,  <' 1" I ',.., i:J '.
ł ,. ."!', i J
;; 
.""
l :"?!. L>i,:._:'
"'
Ryc. 1. Arkady Fiedler z synami - Arkadym Radosławem i Markiem - w Kanadzie.
Fot. B. Sroka
obozowego jadłospisu), zauważył Arkady po pogawędce z panią Anną i komplemencie dla zwinnej kelnerki. Po obiedzie zaczęło się poszukiwanie indiańskich wyrobów dla Muzuem w Puszczykówku. Drzwi większości 

Use Bona Nova font by Andrzej Heidrich

http://bonanova.wtf/ + Fira Sans

@font-face {
  font-family: "BonaNova-Bold";
  src: url("../fonts/BonaNova/BonaNova-Bold.eot");
  src: url("../fonts/BonaNova/BonaNova-Bold?#iefix") format("embedded-opentype"), url("../fonts/BonaNova/BonaNova-Bold.woff") format("woff"), url("../fonts/BonaNova/BonaNova-Bold.ttf") format("truetype");
  font-weight: normal;
  font-style: normal;
}
@font-face {
  font-family: "BonaNova-Italic";
  src: url("../fonts/BonaNova/BonaNova-Italic.eot");
  src: url("../fonts/BonaNova/BonaNova-Italic?#iefix") format("embedded-opentype"), url("../fonts/BonaNova/BonaNova-Italic.woff") format("woff"), url("../fonts/BonaNova/BonaNova-Italic.ttf") format("truetype");
  font-weight: normal;
  font-style: normal;
}
@font-face {
  font-family: "BonaNova-Regular";
  src: url("../fonts/BonaNova/BonaNova-Regular.eot");
  src: url("../fonts/BonaNova/BonaNova-Regular?#iefix") format("embedded-opentype"), url("../fonts/BonaNova/BonaNova-Regular.woff") format("woff"), url("../fonts/BonaNova/BonaNova-Regular.ttf") format("truetype");
  font-weight: normal;
  font-style: normal;
}

Set memory limit for sphinx in Docker

Aug 23 11:06:17 bjornoya kernel: [2125101.554554] Out of memory: Kill process 31364 (searchd) score 259 or sacrifice child
Aug 23 11:06:17 bjornoya kernel: [2125101.555535] Killed process 31364 (searchd) total-vm:361236kB, anon-rss:3548kB, file-rss:127356kB

Microdata - share'owanie

 <meta property="og:site_name" content="dzieje.pl" />
 <meta property="og:type" content="article" />
 <meta property="og:title" content="Radio w Poznaniu rozpoczęło nadawanie 90 lat temu" />
 <meta property="og:description" content="90 lat temu, 24 kwietnia 1927 roku nadawanie rozpoczęła poznańska stacja radiowa. Rozgłośnia regionalna Polskiego Radia, dawne Radjo Poznańskie, obecne Radio Merkury zmienia przy okazji jubileuszu swoja nazwę na Radio Poznań." />

Update to Sphinx v3.0.1

http://sphinxsearch.com/blog/2017/12/18/sphinx-3-0-1-released/ / http://sphinxsearch.com/files/sphinx-3.0.1-7fec4f6-linux-amd64.tar.gz

v2.3.2

Server version: 2.3.2-id64-beta (4409612) 

Copyright (c) 2000, 2017, Oracle and/or its affiliates. All rights reserved.

Oracle is a registered trademark of Oracle Corporation and/or its
affiliates. Other names may be trademarks of their respective
owners.

Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.

mysql> show index wbc status;
+-------------------+--------------------------------------------------------------------------------------------------------+
| Variable_name     | Value                                                                                                  |
+-------------------+--------------------------------------------------------------------------------------------------------+
| index_type        | disk                                                                                                   |
| indexed_documents | 6968                                                                                                   |
| indexed_bytes     | 123948611                                                                                              |
| ram_bytes         | 156329300                                                                                              |
| disk_bytes        | 220212818                                                                                              |
| query_time_1min   | {"queries":0, "avg":"-", "min":"-", "max":"-", "pct95":"-", "pct99":"-"}                               |
| query_time_5min   | {"queries":8, "avg_sec":0.072, "min_sec":0.000, "max_sec":0.443, "pct95_sec":0.221, "pct99_sec":0.221} |
| query_time_15min  | {"queries":8, "avg_sec":0.072, "min_sec":0.000, "max_sec":0.443, "pct95_sec":0.221, "pct99_sec":0.221} |
| query_time_total  | {"queries":8, "avg_sec":0.072, "min_sec":0.000, "max_sec":0.443, "pct95_sec":0.443, "pct99_sec":0.443} |
| found_rows_1min   | {"queries":0, "avg":"-", "min":"-", "max":"-", "pct95":"-", "pct99":"-"}                               |
| found_rows_5min   | {"queries":8, "avg":6, "min":0, "max":24, "pct95":19, "pct99":19}                                      |
| found_rows_15min  | {"queries":8, "avg":6, "min":0, "max":24, "pct95":19, "pct99":19}                                      |
| found_rows_total  | {"queries":8, "avg":6, "min":0, "max":24, "pct95":24, "pct99":24}                                      |
+-------------------+--------------------------------------------------------------------------------------------------------+
13 rows in set (0.00 sec)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.