GithubHelp home page GithubHelp logo

avni-infra's Issues

load is high on Read DB used by Superset reports

Freshdesk ticket: https://avni.freshdesk.com/a/tickets/3050

How to reproduce:

  • Open multiple APF reports simultaneously
  • You will face the issues as attached in the screenshot

Image

Image

Analysis:

  • superset-db DB load Monitoring graph shows that it has frequently reached the maximum limit
  • So, might need to upgrade from micro to small

Why high priority:

APF users use reports on a day-to-day basis and hence affecting their daily work.

Upgrade Prod Superset instance to version 4.0.1

Acceptance Criteria

  • Upgrade our current prod superset to version 4.0.1. https://github.com/apache/superset/releases/tag/4.0.1
    • The metadata is currently stored on the ec2 instance itself, therefore, we'll need to ensure that we take a manual backup before we start with the upgradation work.
  • Also, setup automatic storage backups for the prod superset ec2 instance for recovery purposes
  • Configure the system to use required features
    • TAGGING_SYSTEM
    • Drill-By
    • Full CSV export
      Discuss with implementation team and update this section.

Fixing issue with existing report after upgradation

  1. Analyse report.
  2. If report broken then Categorise into type of issues .
  3. Document recovery steps for each issue.
  4. Apply fix in all broken reports.

Validation

  • All existing reports should not be broken, make changes in reports via sql commands or so to fix large scale modifications required

Tech details

  • Create a replica of prod superset ec2 instance
  • Upgrade to latest 4.1 version of superset
  • No Containers switch
  • No DB upgrade
  • Configure the system to use all new features
    • TAGGING_SYSTEM
    • Drill-By
    • Full CSV export
  • Compare the old and new systems and give it out for UAT to implementation team
  • After successful UAT, perform the similar upgradation on Prod Superset
  • Bring down the temp infra which is no longer required (Routes and EC2 instances)

Reference doc

https://preset.io/blog/apache-superset-4-0-release-notes/
https://preset.io/blog/superset-3-0-release-notes/

Connect to RWB env prod db from metabase

https://avni.freshdesk.com/a/tickets/3762

  • Connect to RWB env prod db from metabase by applying appropriate SSH tunnelling/VPC config as needed.

Note: Metabase and RWB prod db are in different AWS accounts

Acceptance criteria

As a metabase user of rwb in reporting.avniproject.org, I should be able to query the avni gramin database.

One option to do this is to make the db public and add security groups for metabase to be able to contact it.

Add alert for performance metrics

Issue:

  • Add notification to product team via email related to performance metrics.

Acceptance criteria:

  • The below points are for RWB env
  • Add notification for the metric: Avg response time from newrelic - we already have this for Avni. Looks like currently the notification is linked to only vinay's mail id. Try to make it to send notification to product team.
  • notification when diskspace for all machines in AWS reaches, say 80%
  • also make sure notification setup via freshping when RWB server is down

Optimizing the cost for sending SNS messages by using local routes instead of shared.

Message from AWS support:

By default, when you send messages to recipients in India, Amazon SNS uses International Long Distance Operator (ILDO) connections to transmit those messages. The price for sending messages using ILDO connections is higher than the price for sending messages through local routes. The price for sending messages using local routes is shown on the Amazon SNS Worldwide SMS Pricing page here: https://aws.amazon.com/sns/sms-pricing/

Looks like you used a shared routes to send SMS messages to India.

Lastly, you should consider making use of local routes by following our documentation on registering their own senderIDs:

[+] https://docs.aws.amazon.com/sns/latest/dg/sns-register-entity-and-template.html

Consolidate load balancers

We have 2 classic load balancers and 4 application load balancers.
A single application load balancer should be sufficient for our needs - we can route based on the request URL.

Automerge and Autobranching

Need:

Frequently we face issues where we forgot to merge the branches of some repos etc.,, or we merge only with master and not to all branches belonging to later releases. Currently to remember this and do it is complex.

Approach:

Automate to avoid human errors. Also the merge conflicts might be less considering the merge will be done on an ongoing basis.

AC:

  • Automatically create branches in all repos
  • Automatically merge the dependent branches based on declaration from a JSON file. Fail the build when merge conflicts arise.

Migrate all Reporting Databases onto Production RDS

Reporting Database are spread across different RDS instances.

  • Prod Jasper is using Staging RDS jasperserver DB
  • Prod Metabase is using Reporting RDS
  • Prod Superset is using Prod-superset EC2 instance local filesystem based PSQL DB, soon to be migrated to Prod RDS superset DB

We would like to consolidate all the Reporting DB into one RDS, the PROD RDS, to optimize our resource utilization and simplify Network Access.

Modify CircelCI deploys to be done using AWS Roles and AWS OIDC Context

Acceptance criteria

Modify CircelCI deploys to be done using AWS Roles and AWS OIDC Context. We do not want to make use of openchs-infra.pem file for circle-ci config deploys anymore.
Refer to the way we have done this for RWB Staging and Prod environments deploy in avni-server.
https://github.com/avniproject/avni-server/blob/master/.circleci/config.yml

Changes have to be done for following repo deployments

  • avni-server
  • avni-webapp
  • rules-server
  • avni-media
  • integration-service
  • etl
  • integration-admin-app

Excerpts from ciconfig file with changes done for RWB staging deploy

Sample role config:

  RWB_STAGING_deploy:
    docker:
      - image: cimg/deploy:2023.09-node
    working_directory: ~/
    steps:
      - aws-cli/setup:
          role_arn: "arn:aws:iam::730335671779:role/avni_circleci_instance_connect"
          region: "ap-south-1"
      - setup_server_access:
          instance-id: "i-00b50ac6e8413fdca"
          availability-zone: "ap-south-1b"
      - deploy_ansible:
          env: "rwb-staging"

Sample Context config:

      - RWB_STAGING_deploy:
          context:
            - RWB_AWS_OIDC
            - non-prod-deploy
          requires:
            - RWB_STAGING_approve

Migrate Superset persistence from sqlLite to Postgres

Migrate Superset persistence from sqlLite to Postgres DB on Prod RDS.

Options to migrate:

  1. Setup empty psql db for version 4.0.1 and Copy over data in CSV format, seems most feasible. Refer https://medium.com/@aaronbannin/migrating-superset-to-postgres-63d2c96c5102
  2. Convert sqlLite file to psql. Is tedious and its not guaranteed that the output superset DB would actually be usable by superset app correctly at all times. Refer https://stackoverflow.com/questions/4581727/how-to-convert-sqlite-sql-dump-file-to-postgresql

Setup reserved instances for EC2 and RDS

We are still paying on-demand prices for AWS resources though we have always-on systems.

Reserved instances give 20-30% savings. There are options to make zero upfront payment for 1 year which are convertible to other resource classes and can give us quick easy savings. Resources across the same resource class type can share any reservations we setup i.e. if we setup a t3.medium RI, it will get applied to any t3 resource we have.. it doesn't necessarily tie to a t3.medium resource.

Use containerized applications in Avni

Avni load-balancing, networking and system update considerations

  • Growing number of application modules which we have in Avni (avni-server, rules-server, avni-webapp, avni-media-server, avni-media-client, etl, avni-integration-service and avni-int-admin-app, keycloak, Minio)
  • The need for these modules to easily interact with each other over network
  • The need to ease of deployment and maintenance of these app modules
  • Optimize resource utilization
  • Load-balancing needs that might arise in future
  • Reporting software upgrades
  • SSL certificates upgrades and using a single chained one

Proposed solution:

In-order to handle the above considerations, we should seriously consider using Docker Containers and Kubernetes clusters along with some auth layer rework of Avni

Create docker container config for avni apps - 20d

  • avni-server - 2d
  • webapp - 2d
  • rules-server - 2d
  • integration-service - 1d
  • integration-admin-app -1d
  • etl - 2d
  • avni-media-server - 1d
  • avni-media-client - 1d
  • keycloak - 1d
  • Minio - 1d
  • JasperServer - 2d
  • Metabase - 2d
  • Superset - 2d

Create Kubernetes config for Avni - 22d

  • Create docker hub - 0
  • Load-balancers - 3d
  • network config for inter app communication - 5d
  • Reverse-proxy - 2d
  • services - 3d
  • deployments - 5d
  • HorizontalPodAutoscaler for few apps - 2d
  • SSL cert in-front of ReverseProxy - 2d

Handle logging, uptime and debugging aspects - 5d

Handle build and deploy of applications using CirceCI(or other CI tools) - 10d

Setup staging env using kubernetes - 2d

Setup Prod env using kubernetes - 2d

Switch Prod DB from avni-server standalone to kubernetes avni-server. - 2d

Total - 63days

Optimize resource utilization (EC2, LBs, Network and RDS) for RWB

Optimize resource utilization (EC2, LBs, Network and RDS) for RWB.

Step 1:

Perform analysis on cost saving as well as rsource utilization optimizations that could be done

  • Instance type changes for resource utilization optimizations
  • Reserve instances for billing discounts
  • Discarding redundant instances
  • etc..

Step 2:

Submit recommendations with information on

  • change
  • impact (postivie / negative)
  • cost-savings
  • Action-plan and any follow-up activity needed

We'll review the recommendations and provide approval for specific changes

Step 3:

Implement approved changes. Perform any other follow-up activity needed, which was identified during analysis.

Migrate RDS T2 instances

RDS T2 instances will be EOL in April 2024.
reportingdb is still on t2 and needs to changed to a new instance type.

Setup a job to reboot Jasper server periodically

Jasper server becomes unresponsive every few weeks currently. Rebooting/restarting the service usually fixes the issue. Automate this so downtime is during non business hours and no manual processing is required.

Acceptance criteria:
Add a cron job to reboot the server daily / weekly during the early morning hours IST.

Handle log-file config for avni-server

Handle log-file config for avni-server.
Currently we redirect all logs to server.log without logFileRotation and maxSize limit.
This has adverse impact for int-service and etl-service deploys as well.
We should let each appserver handle log for itself.

Migrate Superset to Docker based setup

Need:

Setup latest version of superset to use Docker Container based setup for easy upgrades.

Acceptance criteria:

  • Setup Container based superset of version 4.1 using prerelease rds superset DB
  • Configure it similar to prod superset
  • After User Acceptance Test successful confirmation, switch the superset dockerized system to use prod rds and route all superset routing traffic to the new one
  • Bring down all other migration infra after taking necessary backup

Tech Steps (Blue Green Deployment approach)

  1. Take snapshot of superset ec2 instance EBS volume
  2. Setup a new EC2 instance superset Blue
    • use Docker container setup to install version 4.1 superset
  3. Configure new superset to work with Pre-release RDS superset DB
  4. Configure new superset to have all required features enabled (Tagging, Drill down, etc..)
  5. Validate that all reports are working in new superset same as the old one
    If all things are working fine and Implementation team gives go ahead, then
  6. Stop both old and new Superset services
  7. Point new superset to Prod RDS superset db
  8. Switch superset Hostname from old to the new superset ec2 instance
  9. Take one last backup of old superset and bring it down
  10. Remove the new superset Route53 entry

suggestions: Ignore

  • current version
  • prod ready not in scope.
  • one-click installer/ via docker /ansible
  • upgrade current machine to 4.1 - since already tried in security env

RDS CA certificates expiring

Image

Tasks

No tasks being tracked yet.
  • The affected RDS instances are below
    proddb02, proddb02-read, superset-db. - all are having "rds-ca-2019" as their Certificate Authority which is getting expired on August 22, 2024 and we need to update the Certificate Authority to make the DB healthy and security is upto date.

  • The changes we need to make is Modify the Certificate Authority to "rds-ca-rsa2048-g1(default)" which expires on May 20, 2061.

Suggestion:

  • Apply in following order:- superset-db, proddb02-read, proddb02

Services running using pm2 do not restart on server reboot

rules-server, avni-media are run using pm2. If the instance on which these are running gets restarted, these services need to be configured to be restarted too.

Impacted sevices:

  • Media-server (Integration instance)
  • Rules-server (Avni-server instance)

Environments:

  • Prod
  • Prerelease
  • Staging
  • RWB_prod
  • RWB_Staging

Reduce DB storage size

Context:
When we were using gp2 disks for the database, the only cost effective way to get additional IOPS was to bump up the disk size as a result of which we have 300GB database but our usage is less than 10GB. Now that we have moved to gp3, IOPS is not tied to disk size and we can lower the database size.

The DB backups also contribute to the cost impact of this storage size.

RDS does not offer an easy way (from the console) to do this and this will require a custom solution. Refer for methods

Fix renewal of certs for Minio

Currently, the letsEncrypt certs are in a different folder than the place where we have the Minio certs folder "/etc/minio/ssl".

We need to add a deploy-hook to certbot to copy the renewed certs to "/etc/minio/ssl".

Implement new access and security policy for AWS resources

  1. We might need a separate AWS account that runs production. This means eventually we will be running atleast 3 different AWS accounts - 5d
  2. Come up with list of target entities - 1d
  3. Come up wit privileges for those target entities - 1d
  4. Come up with env based grouping of target entities and privileges - 1d
  5. Come up with Base set of roles (Target entity + Privilege + environment => Role), includes - 2d
    -- AWS console access -
    -- SSH access to servers
    -- AWS service access (Ec2, RDS, Cognito, S3, etc)
  6. UserGroups will be assigned one or more Roles - 1d
  7. Users will be a part one or more userGroups - 0d
  8. We’ll then do staggered switch from old way of access to the new approach - 0d
  9. Deprecate the old SSH keys and AWS credentials which grant role/user-group agnostic access - 2d

Total: 8d * 2(Ramp-up, Misc tasks, bugs/issues) = 20d => 4 weeks - High Level estimate

Avni read db to medium and change in password for prod main

  • running queries on read db is becoming very slow and this must also be impacting the customers queries from reports
  • due to this I have been running queries on prod master which is not a good idea
  • the prod main password can be made confidential so that only few people have the password for it. this would force all database updates to go through a review process

upload-user creation and documentation

Need:

To run rules on the data uploaded via CSV. For RWB, work orders are uploaded via CSV. Though currently there are no rules, it is better to get it setup.

Acceptance criteria:

  • Make sure upload-user is created in RWB cognito pool
  • Also make sure to configure OPENCHS_UPLOAD_USER_USER_NAME and OPENCHS_UPLOAD_USER_PASSWORD in the machine where rules_server is setup
  • add documentation for it in setting up AWS environment readme.
  • If it will take less time do for LFE as well. Note: rules-server is not setup for LFE as of now.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.