GithubHelp home page GithubHelp logo

dreamproit / bill-similarity Goto Github PK

View Code? Open in Web Editor NEW
1.0 1.0 0.0 4.04 MB

Calculate similarity of bill documents using a variety of NLP approaches

Python 100.00%
bills legislation nlp similarity similarity-measures us-congress

bill-similarity's People

Contributors

aih avatar alexbojko avatar dmytro-ustynov avatar

Stargazers

 avatar

Watchers

 avatar

bill-similarity's Issues

Estimate effort to run automated similarity calculations and save to API database

This issue is to estimate the task to:

  1. Process a bill with a similarity algorithm to return a list of similar bills in the form of a BillToBill model (e.g. on the investigate_simhashes branch: https://github.com/dreamproit/bill-similarity/pull/4/files)
  2. Save the similar bills to the BillToBill table of billtitles-py (https://github.com/dreamproit/billtitles-py/blob/main/billtitles/models.py#L72). This uses the helper function, create_billtobill to save the billtobill data:
    https://github.com/dreamproit/billtitles-py/blob/main/billtitles/crud.py#L123
  3. Create a pipeline to do this:
    a. Once for all bills
    b. Each time bills are updated by the uscongress bill scraper

Related to dreamproit/BillMap#13

This is the equivalent of the pipeline that is already working to populate the database for BillMap, described here:
https://github.com/dreamproit/bill-similarity/blob/investigate_simhashes/docs/SQL_APPROACH.adoc#current-data-pipeline-and-storage

Get more bills

In repository provided on start there were several files in xml format such as samples/congress/116/uslm/BILLS-116hconres9enr.xml etc.
And there were a pasing script to get sections from each bill.
But looks like that this script doesn't work with the other bills from the set we download via congress tool.
So the main question is:
How (where) can I get more, preferrably the whole set of bills that i can split to sections for further work?
May be (that's just my suggestion) we should transform the parsing script so it would parse that set?
Or there is some step of transformation that i still haven't found yet, isn't it?

Anyway the main point is to get more bills to get get more sections from them.

Example issue

@dmytro-ustynov pls pay attention to this section when you file an issue:

Screenshot 2022-04-06 at 23 13 39

While you are creating an issue you can link it to the GitHub Project.

And Then

When issue's already created you can add it to the sprint, estimate it and change its status.

INIT ISSUE. Investigating project

This is a reminder for me (TODO)
Just a start issue for project onboarding.
some issues to explain and investigate:

  • project structure, environment requirements
  • parsing tools
  • bill structure and what part it consist of

Related issues that are also helpfull to figure out in context of searching similarities.

  • Simhashing and minhashing

Further README documentation of simhash process

The current simhash README says how to set up the system (good); and run all bills (good). What is missing is to know how to

  • run a single bill (by bill name or by bill path?),
  • run a batch of bills
  • figure out what is the form of the output of the functions. Is it a list, a dictionary, a JSON object? What are the fields that are output and how do we associate the results with each bill?
  • use the output to populate the billsim database

Also, we should combine this information with details of how the simhash + SQL query (by @alexbojko) works and what the differences are from pure simhash.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.