lucidworks / searchhub Goto Github PK

Fusion demo app searching open-source project data from the Apache Software Foundation

License: Other

JavaScript 20.31% CSS 8.90% HTML 9.93% Python 31.58% Java 19.39% Scala 9.79% Shell 0.11%

lucidworks-fusion flask-application docker-container fusion solr spark

searchhub's Introduction

Lucidworks Search Hub

Search Hub is an application built on top of Lucidworks Fusion.
It is designed to be a showcase of Fusion's search, machine learning and analytical capability as well as act as a community service for a large number of Apache Software Foundation projects. It is the basis of several talks by Lucidworks employees (e.g. http://www.slideshare.net/lucidworks/data-science-with-solr-and-spark). A production version of this software hosted by Lucidworks is available at http://searchhub.lucidworks.com.

Search Hub contains all you need to download and run your own community search site. It comes with prebuilt definitions to crawl a large number of ASF projects, including their mailing lists, websites, wikis, JIRAs and Github repositories. These prebuilt definitions may also serve as templates for adding additional projects. The project also comes in with a built-in client (based off of Lucidworks View

This application uses Snowplow for tracking on the website. In particular, it tracks:

Page visits
Time on page (via page pings)
Location
Clicks on documents and facets
Searches

Search Hub is open source under the Apache License, although do note Lucidworks Fusion itself is not open source.

Requirements

You'll need the following software installed to get started.

Node.js 5.x: Use the installer for your OS, e.g. brew install homebrew/versions/node5
Git: Use the installer for your OS.
virtualenv: Use the installer for your OS
Depending on how Node is configured on your machine, you may need to run sudo npm install -g gulp bower instead, if you get an error with the first command.
Python 2.7 and python-dev
Fusion 3.1. Otherwise, to use Fusion 3.0.x use the tag 3_0_cutover and download Fusion 3.0.x from the Lucidworks website to use Fusion 2.4.x use the tag pre_3_0_cutover and download Fusion 2.4.x from the Lucidworks website.
If you want to crawl the Github sources, you'll need a Github API key: https://github.com/blog/1509-personal-api-tokens
If you want to crawl Twitter, you will need Twitter keys: https://dev.twitter.com/oauth/overview

Get Started

In ~/.gradle/gradle.properties, add/set:

searchhubFusionHome=/PATH/TO/FUSION/INSTALL

The searchhubFusionHome variable is used by the build to know where to deploy custom plugins that the Search Hub project needs (namely, a Mail Parsing Stage)

If you haven't already, clone this repository and change into the directory of the clone.

git clone https://github.com/LucidWorks/searchhub
cd searchhub

Run the Installer to install NPM, Bower and Python dependencies

./gradlew install

(Re)Start your Fusion instance (see Requirements above, this needs to be Fusion 2.4.x) This is important since deployLibs (task called by the install task) installed the MBoxParsingStage into Fusion.

Build the UI: This will copy the client files into python/server. NOTE: This is deprecated.

./gradlew buildUI

If you prefer using Gulp, you can also run gulp build

Setup Python Flask:

source venv/bin/activate
cd python
cp sample-config.py config.py
#fill in config.py as appropriate. You will need Twitter keys to make Twitter work.  You will need a Github key to make Github work.
../venv/bin/python bootstrap.py

NOTE: Before you can successfully run the bootstrap you must create a lucidfind user in the fusion admin panel. The bootstrap.py step creates a number of objects in Fusion, including collections, pipelines, schedules and data sources. By default, the start up script does not start the crawler, nor does it enable the schedules. If you wish to start them, visit the Fusion Admin UI or do one of the following:

To run the data sources once, upon creation (note: this can be quite expensive, as it will start all datasources):

cd python
../venv/bin/python bootstrap.py --start_datasources

To enable the schedules, edit your config.py and set ENABLE_SCHEDULES=True and then rerun python bootstrap.py

Running Search Hub

Local, Non-Production Mode using Werkzeug

Run Flask (from the python directory):

cd python
../venv/bin/python run.py

Browse to http://localhost:5000

If you make changes to the UI, you will either need to rebuild the UI part (npm build) or run:

npm watch

Production

Docker

The easiest way to spin up the Search Hub Client and Python app is by using Docker and the Dockerfile in the Python directory.

This container is built on httpd and mod_wsgi

To build a container, do the following steps:

Edit your FUSION_CONFIG.js to point to the IP of your container. You can do also do this afterwards too, by attaching to the running container and editing it.
Build the SearchHub UI (see above) so that the Client assets are properly installed in the Python server directory
cd python
Create a config-docker.py file that contains the configuration required to connect to your Fusion instance. Note, this Docker container we are running now does not run Fusion.
docker build -t searchhub . -- This builds the Docker container
docker run -it --rm -p 8000:80 --name searchhub searchhub -- This runs the container and maps to port 8000. See Docker help for otherways to run Docker containers
Point your browser at http://host:8000/ where host is the IP for your Docker container.

Some other helpful commands:

docker rmi -f searchhub -- delete a previously built version of the container

WSGI Compliant Server

See docker.sh in the Home directory for how to build and run mod_wsgi_express in a Docker container.

Scaling

Lucidworks' production instance is built using Solr Scale Toolkit -- aka SSTK -- using a Public/Private VPC setup.
The public facing Docker application (i.e. the Client Application below) sits in a public subnet with port 80 exposed. Everything else is in a private subnet and the public subnet can only reach the private subnet via port 8764.

The commands used to deploy Fusion using SSTK are as follows:

fab new_ec2_instances:shub,n=3,instance_type=r4.2xlarge,az=us-east-1e,purpose='Test r4 instance types',vpcSubnetId='subnet-XXXXXXX',vpcSecurityGroupId='sg-XXXXXXX'
fab attach_ebs:shub,size=800,volume_type=gp2
fab setup_solrcloud:shub,zkn=3
fab upload_fusion_plugin_jars:shub,jars='/home/MY_USER/searchhub-fusion-plugins-0.1.jar' -- note, you need this file locally on the machine you are running SSTK on
fab fusion_start:shub,ui=3

Due note, that because of the private Subnet, the machine you are running SSTK on needs access to that machine, so we typically use a proxy node that is locked down and has all of our tools installed on it.

The Client Application

The Client Application is an extension of Lucidworks View and thus relies on similar build and layout mechanisms and structures. It is an Angular app and leverages FoundationJS. We have extended it to use the Snowplow Javascript Tracker for capturing user interactions. All of these interactions are fed through the Flask middle tier and then on to Fusion for use by our clickstream and machine learning capabilities.

Configuration

In order to configure the client application you can change the settings in the FUSION_CONFIG.js. See the View docs for more details or read the comments in the config file itself.

Extending

Pull Requests are welcome for new projects, new configurations and other new extensions.

Project Layout

The Search Hub project consists of 3 main development areas, plus build infrastructure:

Client

Written in Javascript, using AngularJS and Foundation, the Client is located in the client directory. It's build is a bit different than most JS builds in that it copies Lucidworks View from the node_modules download area into a temporary build directory and then copies in the Search Hub client code into the same directory and then it gets built and moved to the Flask application serving area (python/server). We are working on ways to improve how View is extended and so this approach, while viable for now, may change. Our goal is to have most of the Client UI be driven by View itself with very little extension in Search Hub.

Python

The python directory contains all of the Flask application and acts as the middle tier in the application between the client and Fusion. Most of the work in the application is initiated by either the bootstrap.py file or the run.py file. The former is responsible for using the configurations in python/fusion_config and python/project_config to, as the name implies, bootstrap Fusion with datasources, pipeline definitions, schedules and whatever else is needed to make sure Fusion has the appropriate data necessary to function. The latter file (run.py) is a Flask app that takes care of the serving of the Flask application. It primarily consists of routing information as well as a thin proxy to Fusion.

Most of the Python work is defined by the python/server directory. This directory and it's children define how Flask talks to Fusion and also defines some template helpers for creating various datasources in Fusion. A good starting place for learning more is the fusion.py file in python/server/backends

Fusion Plugins

The searchhub-fusion-plugins directory contains Java and Scala code for extending and/or utilizing Fusion's backend capabilities. On the Java side, the two main functions are:

A Mail Parsing Stage that is responsible for extracting pertinent information out of Mail messages (e.g. thread ids, to/from)
A Mail downloader. Since we don't want to tax Apache Software Foundation resources directly when crawling (they have a banning mechanism), we have setup an httpd mod_mbox mirror.
The mail downloader is responsible for retrieving the daily mbox messages. If you wish to have a local mirror for your own purposes, you can use this class to get your own mbox files.

On the Scala side, there are a number of Spark Scala utilities that show how to leverage Lucene analysis in Spark, run common SparkML tasks like LDA and k-Means plus some code for correlating email messages based on message ids. See Grant Ingersoll's talk at the Dallas Data Science meetup for details. To learn more on the Scala side, start with the SparkShellHelpers.scala file.

The Build

The build is primarily driven by Gradle and Gulp. Gradle defines tasks, per the getting started above, for all necessary tasks needed to run Search Hub.
However, on the client side of things, it is simply invoking npm or Gulp to do the Javascript build. To learn more about the build, see build.gradle.

Adding your own Project to Crawl

To add another project, you need to do a few things:

In $FUSION_HOME/python/project_config, create/copy/edit a project configuration file. See accumulo.json as an example.
In $FUSION_HOME/searchhub-fusion-plugins/src/main/resources, edit the mailing_lists.csv to add your project.
If you are adding more mailing lists, you will need to either crawl the ASF's mail archives site (please be polite when doing so) or setup an httpd mod_mbox instance like we have at http://asfmail.lucidworks.io. If you submit a pull request against this project with your mailing_lists.csv changes, we will consider adding it to our hosted version.

searchhub's People

Stargazers

Watchers

Forkers

adammichaelwood pthinker jonbaer henrypan xiaoyehhuang smarthi yannyu shadowridgedev kylemccann l-seven sensecollective liamstar97 chubbymaggie marcussorealheis alirezabayatmk rogervaas

searchhub's Issues

Add banner support

Add support for banners/landing pages to promote specific results or download links and other editorial content.

Add Time based chart at the top of the view

Using D3 or something similar, add a timeline based graphic that displays number of messages over time.

Recommender

Add a backend recommender module to searchhub, starting with a moreLikeThis based on content as well as co-authorship on a thread.

Enable continuous deployment of UI

Switch our development strategy a bit such that we can always deploy off of master.

To do:

Add in scripts for building/deploying off of master continuously
Document procedures for branching and merging to master
Hook in appropriate monitoring and testing

Add support for sharding/replicas in fusion.py create_collection

In fusion.py, we should add support for setting up collections with sharding/replica info.

{"solrParams":{"replicationFactor":1,"numShards":1}} is the payload, more or less.

The default for the lucidfind collection should be repl. factor 2, shards 1.

Change publishedOnDate to be a "tdate" field with precisionStep = 6

Currently, publishedOnDate is a "date" field type which has a precisionStep of 0. Since we are primarily using this field for range faceting w/ a month gap, we should up the precision step by switching to the tdate field.

To do this in production, we likely need to actually do:

setup a new field: publishedOnDate_facet of type tdate and copy field the original
change the UI/queries to use this new copy date
clear the crawl db (don't delete the docs)
recrawl.

Add signals onto the "Read More/Less" links

Track when users click on the "Read More/Less" links. Add a call into the SnowplowService to support this. This may need a callback function on View.

Some messages have bad published dates

For instance: 190728634-04-21T19:16:33Z

Something must be going wrong w/ parsing of the message headers.

Add taxonomy/rules support

Depending on where the user is coming from, we should have different taxonomy/rules to boost content. For instance, if coming from the ASF, then boost mailing lists, if coming from Lucid, boost Lucid content.

Add datasources for Lucidworks videos (webinars, et. al) and Lucene/Solr videos

Setup crawlers for:

Project name query parser

It would be nice to do like JIRA does and detect when a user just wants to search within a particular project by adding query parsing to support this.

For instance "fusion html transform stage" would select "fusion" as a project filter and then execute the rest of the query with that filter applied.

Setup signals scheduled jobs

Setup and schedule the jobs necessary to build the click stream models and other aggregation jobs

Add all the active ASF projects

We should probably just add all the ASF projects, or at least all the active ones, or at least most of them.

Add Registration Support

Support a user registering for the site and then authenticating/logging in.

When enabling registrations/logins, be sure to set the user id in Snowplow tracking: https://github.com/snowplow/snowplow/wiki/1-General-parameters-for-the-Javascript-tracker#user-id

Users can be created in Fusion with the Search Role.

Mail.java getText() improvements

In Mail.java, the getText(Part p) method is preferring HTML bodies over plain text, which screws up display and searching. We should make this more flexible by returning the text for all the parts, that way further upstream we can make a more informed choice. Eventually, we should just feed all these parts into the pipeline and let the pipeline decide.

Username in Requests

When a user is logged in (See #30), we should pass in the username to all requests (searches, etc.), not just the signals.

Add more time based signals

Add signals around time to first click, between clicks, time on page

Calculate "lifetime value" of contributors

It would be interesting to calculate the "lifetime value" of contributors to a project. We should be able to use a variety of metrics from each project to determine such a factor.

Using filters generates "There are no results" message (with HTTP 400 error in JSON)

I did an ego search (Rafalovitch) and then tried using various filters. Selecting more than one author triggers the error. So it seems a single date range. The sample queries are below:

http://searchhub.lucidworks.com/api/apollo/collections/lucidfind/query-profiles/default/select?q=rafalovitch&start=0&wt=json&rows=10&uuid=7cb61ab9-6cde-4b1a-8ea1-2b98ebf2d4f3&fq=isBot:(%22true%22)&fq={!tag=auth}author_facet:(%22Alexandre%20Rafalovitch%20(JIRA%20OR%20%22arafalov%22)&sort=publishedOnDate%20desc

http://searchhub.lucidworks.com/api/apollo/collections/lucidfind/query-profiles/default/select?q=Rafalovitch&start=0&wt=json&rows=10&uuid=35fb5f56-7f0b-4c51-b3a5-31990a6da985&fq={!tag=auth}author_facet:(%22Alexandre%20Rafalovitch%20(JIRA%20OR%20%22Jack%20Krupansky%22)

http://searchhub.lucidworks.com/api/apollo/collections/lucidfind/query-profiles/default/select?q=Rafalovitch&start=0&wt=json&rows=10&uuid=e49e54a5-2995-4a3b-b16f-ca222ff6b812&fq=publishedOnDate:[Feb+29%2C+2016+TO+Mar+29%2C+2016]

http://searchhub.lucidworks.com/api/apollo/collections/lucidfind/query-profiles/default/select?q=Rafalovitch&start=0&wt=json&rows=10&uuid=dd214c69-58ed-4ebe-9655-2394954093b1&fq=publishedOnDate:[Feb+29%2C+2016+TO+Mar+29%2C+2016]&fq=publishedOnDate:[Mar+29%2C+2016+TO+Apr+29%2C+2016]

Set isBot to false for website, wiki crawls

We should explicitly set isBot to false for non mail files (wiki, website)

Add filtering based on Solr _version_ field to mail threading jobs

When continually crawling and indexing into Solr, as we'll do in production for searchhub, we need to make sure the threading batch job can read from a consistent point from Solr each time, by adding a .filter(s"version < $maxCurrentVersion") to the DataFrame of messages.

Test issue

test issue

Setup Mail Threading Spark Jobs

Setup and schedule the mail threading jobs as well as the integration into the indexing pipeline.

Make the ASF Mail archive location configurable

Currently, the location of the ASF Mail Archive mirror is hardcoded in mailbox_helper.py. Instead, move this to the config.py section, so it is documented and configurable.

Setup results grouping, proper threading of mail results

When displaying results, roll up related threads, bot messages, blog articles and comments.

Add sorting

Would be good to offer sorting, especially by date.

Mail Threading Jobs

Currently, the mail threading jobs do a : query on Solr, but this retrieves all documents. We should add a "type" field that identifies the type of content (mail, github, website, et. al) and then use that as a filter.

proxy.py returns improper content-type

Hitting the proxy /api endpoint doesn't properly set the content-type that it receives back from Fusion and instead returns text/html.

Add MoreLikeThis recommendation panel

Given #28, we'd like to have on it a More Like This content based recommendation panel.

Signals are not working due to switch form GET to POST

When we switched the Snowplow tracker to POST instead of GET, signals stopped working. The POST payload from Snowplow is quite a bit different than the GET one, so need to adjust accordingly.

Write Fusion Spark Job to Train and Apply LDA model to Mail Index

Train on an equally weighted downsample of the project messages, then apply that model to the full index, indexing the top 3 topics for each message, and set this field as a facetable field.

Then generate top K terms for each topic by looking for those with highest LLR w.r.t. the background, and add some of these terms (possibly take e.g. 10 terms, weighted by which topics have highest weight for this doc) into yet another field.

Add support for notifications and system messages

It would be useful to be able to provide info about the system as notifications (downtime, new features, etc.)

To do this, we should just have a "messages" collections where we can store and load from and then hook into the display.

We will also need to add support in the UI for displaying the messages.

Solr Ref. Guide crawling is not working

Crawling and display of the Solr Reference Guide is not working.

Performance improvements when crawling and recrawling mail archives

Exclude some of the different sort links on the threads (i.e. by date) so that we don't recrawl things we've already seen
Implement RecrawlScript to only get fresh content.

Add Signals Support to the Query Pipeline

Hook in the appropriate stages on the query pipeline to take advantage of signals.

Fix the 'p' route in views.py to properly map old ASF links to new search hub facets

Coming from Lucene (lucene.apache.org), we get queries like http://find.searchhub.org/p:lucene,solr?q=garbage+collection&searchProvider=lucid

We currently have a route for 'p' in views.py, but it does not properly handle mapping the projects to our new project label facets.

Schedule the supporting jobs for the batch-computed recomenders

Minify foundation.js for production

We currently aren't minify foundation.js for production and this is leading to significantly longer load times.

Add JIRA Crawling

Since the Fusion JIRA connector can only crawl an entire JIRA site, we need a JIRA pipeline that drops documents from projects that we are not interested in.

Per Document View

It would be nice to show per document views of results, i.e. when a user clicks on a specific results we take them to a details page.

Write Fusion spark job to do classification (20 newsgroups style) to the mailing list data

We'd like to showcase training and running classification of content as it flows into the system by writing a classifier that predicts what mailing list a message best belongs to and then writing that prediction onto the document itself. We should then show that prediction as a facet in the UI. Would also be good to report in the UI statistics on how our classifier is performing.

Fix document template views in Search Results

We need to clean up the document template views so that they don't show so much text.

They should do either:

Show the highlight snippets
Show some reasonable snippet of text (like a couple of sentences)

For mailing list, we should trim off headers and signatures and the like for the mailing list items.

Add signal spam detection to the Flask Proxy

In order to prevent signal spamming, we should add some spam detection to the proxy.

For starters, perhaps we could watch for a high volume of clicks or other signals from the same IP and/or user within some configurable delta of time, say 10 seconds.

This should go in the views.py "snowplow" route.

Add Stack Overflow topics

For some projects, it would be good to add stack overflow topics.

Incorporate Word2Vec for index time and query time synonym expansion

Setup and schedule a word2vec job that adds word2Vec synonyms for top terms in a document as a field on the document. Then facet/display that

Update the analyzers to strip HTML

Turn on multi-select faceting

Try doing a search, and then select more than one date facet. You'll get no results, because it looks like the filter facets in the same group are AND'ed together. See Solr's documentation for details.

Flask/App Performance Tuning

Now that we are up and running, there are several places we can optimize for performance to make the site faster.

Flask tuning -- compression, others. See http://damyanon.net/flask-series-optimizations/
Solr facet queries, see related issues
clean up code base

Move signals to use a non-admin user

Change the middle tier to send signals using a non-admin user. We should create a new user (or use the lucidfind user) for signals and give them permission to post to the signals index and that is it.

UI doesn't show properly in Safari

When viewing the site in Safari, the search box gets truncated and doesn't display properly. It works properly in FireFox and Chrome.