GithubHelp home page GithubHelp logo

skshub-data's Issues

Link organizations and activities via an organization's legal name

As a user of this platform, I want to know which organization conducted which activities and vice versa in order to have additional information about the activity or organization.

Currently, organizations and activities are only linked using their Business Number. Linking organizations and activities via their legal name will increase the number of results that are connected given the low number of BN the data currently has.

Overhaul and recreate data cleaning process for activities & entities

As a developer of the data, I would like to make sure that there is a robust data cleaning processes used to ensure all the data is clean when uploaded, and that no records are missing due to insufficient data cleaning efforts.

More details:
There are up to 200k activity records missing from the hub as they didn't correctly upload to Postgres, and therefore aren't in the search engine (which was done to avoid pages redirecting to nowhere). Instead, a robust data cleaning effort should be done to make sure those 200k records can be uploaded correctly, and ideally in bulk using one CSV instead of the current process (which uploads them line-by-line and is particularly slow on the DigitalOcean-hosted database)

Deliverable: A fully cleaned & uploadable CSV of all 565k activities

Another minor consideration that is related to date cleaning:
There is a particular case when the all program data point output is "Charity provided description when other program areas are not applicable", this output should be changed to "Not Available" (unless we can find this 'description' somewhere else?)

Add recipient type filed to activities data set

As a user of the platform, it is important to know the type of recipient that is receiving the grant as it allows me to better understand the kinds of activities being conducted.

The Grants and Contributions data contains a "recipient field" type that identifies if the grant recipients are: aboriginal recipients, For-profit organizations, government, international (non-government), nonprofit organizations and charities, individuals, or academia. This information could be useful for certain use cases and could help narrow down results. This field would be added to the activities dataset.

Add linkages (website text) as 3rd CSV on Github data repository

As a user of the data repository for the SKS project, I would like to access the website text information to read it and conduct further analysis.

The CSV should contain these fields: Organization legal name, Business number, website URL, website text and SKS hub ID

Develop the linkages data model

As a developer of the SKS project, I would like to have a clear and identified Linkages data model in order to continue my work on the Linkages CSV and integrate this data into the interface.

More work is required on developing & implementing the linkages data model. Currently the only part of this that is finished is the web scraper but it isn't fully integrated in a similar way to activities and entities, so it would be best to start up a "process_linkages.py" script or similar that operates similarly to the other data types.

Here is the work so far on the Linkages data model

Deliverable: Scripts that integrate the Linkages data model in a similar way to activities (process_activities.py) and entities (process_entities.py)

Create a scraper for documents and websites

As a developer of the SKS project, creating a scrapper to search for and find documents or website URLs would help to populate the linkages and documents datasets.

This could be a multipurpose scrapper or could just be 2 different scrappers.

  • We need to find more organizations' websites and then scrape their website text (This could also be an identifier of websites - take a list of names and find a website.)
  • We need to potentially find documents

Explore finding and adding documents

As a user of the platform, having documents that describe in more depth the activities being conducted or give you more information about the organization would help to supplement the information from the data.

To go forward, a web scraper will be needed to search for and download these documents, then store them in something like an S3 bucket so they can be accessed from the hub. More work is required in planning this before work is started.

Here is the work done so far on incorporating documents into the data model

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.