GithubHelp home page GithubHelp logo

djay / covidthailand Goto Github PK

View Code? Open in Web Editor NEW
125.0 6.0 15.0 30.26 MB

Thailand Covid testing and case data gathered and combined from various sources for others to download or view

Python 99.99% Ruby 0.01% SCSS 0.01% Shell 0.01% Dockerfile 0.01%
daily-situation-reports ccsa-daily-briefing confirmed-cases proactive-tests health-district covid19-data covid-api thailand hacktoberfest hacktoberfest2021

covidthailand's People

Contributors

andamanopal avatar chrisadas avatar djay avatar flyingvince avatar kinshukdua avatar modiholodri avatar pmdscully avatar reduxionist avatar wasdee avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

covidthailand's Issues

testing webdav is down (or gone) so breaking the run

probably best way is to change the code in

def dav_files(url, username=None, password=None,
so that if there is an error connecting it can fall back to listing the all the files in the local dir instead? assume the dir is only used for caching but should be good enough.

web_files has a similar check=True mode mode that will use cached content instead if available.

  tests_reports = get_test_reports()
  File "/home/runner/work/covidthailand/covidthailand/covid_data.py", line 2286, in get_test_reports
    for file, dl in test_dav_files(ext=".pptx"):
  File "/home/runner/work/covidthailand/covidthailand/utils_scraping.py", line 362, in dav_files
    client.list(get_info=True),
  File "/opt/hostedtoolcache/Python/3.9.7/x64/lib/python3.9/site-packages/webdav3/client.py", line 67, in _wrapper
    res = fn(self, *args, **kw)
  File "/opt/hostedtoolcache/Python/3.9.7/x64/lib/python3.9/site-packages/webdav3/client.py", line 264, in list
    response = self.execute_request(action='list', path=directory_urn.quote())
  File "/opt/hostedtoolcache/Python/3.9.7/x64/lib/python3.9/site-packages/webdav3/client.py", line 230, in execute_request
    raise ResponseErrorCode(url=self.get_url(path), code=response.status_code, message=response.content)

switch website to netlify/proper domain

Goal would be to have a better name and sharable url and a single site to go to

  • netlify builds working
  • buy domain name (not covid specific) and put on netlify
  • switch image urls to netlify instead of wiki and turn off wiki image uploading
  • turn off gh pages and switch all url references and put in some kind of redirect on gh pages to the new site?
  • move site to index.md and move downloads and contributions to the readme so thats all you see on github. main site points to github for downloads and contributions
  • move all wiki content to discussions and have FAQ section on website pointing to discussions
  • (unsure) split sections into pages and use menus

Get better data on death ages.

I'm now collecting cumulative CFR for 3 age ranges (since 1st april) from situation reports.
Screen Shot 2021-07-05 at 1 51 13 pm

However these cumulative CFR values can sometimes decrease (see "W3 CFR 15-39" on 29th). CFR can go down if cases increases but the cumulative cases I am using don't seem to match rise at the right times.

So turning this into daily deaths by age is turning out to be a bit tricky as so far I end up with negative deaths on certain days (

# CFR * cases = deaths
)
Either

  • the dates of cases used in their CFR calculation is not the same as found in covid-19
  • the lack of precision of the CFR number throws things out
  • the cases with missing ages in covid-19 dataset throws things out (maybe gets too large on certain days?)

Need to work out how to correct for these to get reasonable numbers that align with min, max and median of reported deaths.

💡 Feature Request: Allow dataset collection errors to occur without fatal errors. I.e. change from Fail Early to Fail Partially.

Problem: Data sources change sometimes and currently exceptions are handled by the main python process. Such that any exception will cause runs to exit with a fatal error. (Fail early)

The feature request is:

  • Decouple each dataset collection procedure, to report (and permit) failures in one dataset source, without halting other collection from other sources. (Fail partial)

Proposed solution:

  1. Decouple Data Collection Functions (or Chains of Functions) (e.g. per output dataset/df or per dataset source/url):
    • i.e. ensure one chain of functions handles one thing/source/dataset.
    • i.e. Break apart main call graph such that each dataset collection is independent from other dataset collections.
    • Specifically, such that one dataset collection can fail independently, but not hold up the other dataset collections.
  2. Independent Execution:, either by:
    • (i) add try-excepts into __main__, around each function call (i.e. entrance into call graph chain).
    • (ii) Or, create separate processes for each chain of functions and let main continue to handle exceptions.

Expected Outcomes:

  • Changes to dataset sources that cause fatal errors, no longer hold up collection from all other dataset sources.

Fix Vaccination plots

  • daily needs same order groups
  • daily needs to handle missing data from groups better
  • % progress missing data and wrong colours.
  • progress missing the new groups
  • gaps in province data #67
  • goes to zero sometimes. dashboard data? got negative daily also
  • unknown shows up as 3rd dose in daily plot. handle unknown properly. #68

fix gaps vaccinations by provinces

image

Could be one of several problems

  • bad parsing on vaccination reports tables
  • missing data on vaccination report tables
  • previous mistakes in dashboard code has stored 0 or other bad data. (this data is not refreshed each time)
  • is dashboard data even used for this graph?
  • if there is no data available is it better to interpolate it?

add predicted vaccination rate to progress by groups plots

perhaps to the end of the year?

Should be able to get the rate for the last 2 weeks and plot a dotted line going forward. Maybe even a line to show where it hits 70% and 80%?

enhanced version might modify the 2nd dose estimate by factoring in time since first dose and vaccines used?

plot home isolation and hospital numbers

dashboard data has extra breakdown on field hospitals

                D_Hospitel="Hospitalized Field Hospitel",
                D_HICI="Hospitalized Field HICI",
                D_HFieldOth="Hospitalized Field Other",

Scrape in parallel

None of the scrapers rely on each other adn can be done in any order. So should be easy to put in a worker queue and use multiple processes to process them faster on actions.

Fix missing allocations by using % given.

According to the raw data published by MOPH, Vaccine allocation data are separated by the manufacturers (Sinovac and Astrazeneca). However, the scraped data shows Vac Allocated AstraZeneca 1 | 2 and Vac Allocated Sinovac 1 | 2. What is the difference between the two?

Also vaccine allocation data of every province is missing after 9th May 2021 .

Thanks.

  • EDIT *

Fixes

  • Improve documentation in the readme about dose 1|2. Maybe better names?
  • Add in Allocated 1|2 value calculated from the given %. (or using allocated where available?)

Vac Given by brands is gone?

Hi Dylan, I noticed that the "Vac Given " is gone from the vac_timeline.csv file. Was it moved to another file or is this a bug? If this is a bug, I could try and help fix it.

Nat

feature: plot trending clusters

  • briefing has clusters listed. Not sure if its all of them or some.
  • Might be interesting to show top 5 clusters that a growing
  • Unique id would be date + cluster name? (maybe province/s?)
  • Cluster name would probably have to be in thai (but at least province would be translated)

Screen Shot 2021-07-16 at 9 56 36 am

Add right side axis labels

The most interesting data are on the right side of the chart. To have labels only on the left side makes it somehow hard to read.

image

Dashboard province data incorrect sometimes

Seems to happen when it needs to get more than one date is one session
for example 2021-09-01 is actually 85 not 35 but 156 is correct.
https://github.com/djay/covidthailand/runs/3492265169?check_suite_focus=true

2021-09-02 00:00:00 MOPH Dashboard Retry Missing data at (datetime.datetime(2021, 9, 2, 0, 0), 'Buriram') for ['Vac Given 1 Cum', 'Vac Given 2 Cum', 'Vac Given 3 Cum']. Retry
2021-09-02 MOPH Dashboard 156.0 0.0 0.0 0.0 156.0 0.0 NaN NaN NaN NaN Buriram
...
2021-09-01 00:00:00 MOPH Dashboard Retry Missing data at (datetime.datetime(2021, 9, 1, 0, 0), 'Buriram') for ['Vac Given 1 Cum', 'Vac Given 2 Cum', 'Vac Given 3 Cum']. Retry
2021-09-01 MOPH Dashboard 34.0 0.0 0.0 0.0 34.0 1.0 NaN NaN NaN NaN Buriram

The existing code tries to reset the connection if there is a failure. somehow it must be going wrong.
https://github.com/djay/covidthailand/blob/main/utils_scraping.py#L694

Solution Proposal for Vac Report Table

Order of columns remains the same. We can just find the row where there is a number and add column names afterward.

def parse_raw(url):
  response = requests.get(url)
  file = open("tmp/daily_report.pdf", "wb")
  file.write(response.content)
  file.close()
  tables = camelot.read_pdf('tmp/daily_report.pdf', pages='2,3',split_text=True)
  raw_table = pd.DataFrame()
  for i in range(2):
    df=tables[i].df
    df=df[df[1].str.isdigit()]
    df.drop([2], axis=1, inplace=True)
    raw_table = raw_table.append(df,ignore_index=True)
  table_dict=raw_table.transpose().to_dict()
  rows=[]
  for row_num in table_dict:
    cleaned_row=[]
    for (key,value) in table_dict[row_num].items():
      for col in value.replace(" ", "").split('\n'):
        if(col): cleaned_row.append(col)
    rows.append(cleaned_row)
  cleaned_table = pd.DataFrame(rows)
  return cleaned_table
df=parse_raw("https://ddc.moph.go.th/vaccine-covid19/getFiles/9/1628485849393.pdf")
test = df.iloc[:,0:12]
test.columns=["Health Area", "Population", "Vac Allocated AstraZeneca", "Vac Allocated Sinovac", "Vac Allocated Pfizer", "Vac Allocated Total", "Vac Given 1 Cum", "Vac Given 1 %", "Vac Given 2 Cum", "Vac Given 2%", "Vac Given 3 Cum", "Vac Given 3 %"]
display(test)

incorrect 3rd dose on daily vaccines when vaccine report delayed

https://github.com/djay/covidthailand/wiki/vac_groups_daily_30d.png

The reason is we have data for 1st, 2nd, 3rd dose earlier than the breakdowns per group.
Dashboard and briefings give us the dose totals. Vac reports give us the groups.

So need to show this data properly. Proberly with a 1st dose, 2nd dose category which is only show on days where the other data is missing. Similar to the unknown feature of teh plot_area function but in this case we have 3 types of data we can show when groups aren't available.

Improved Infections estimate

The current infections estimate uses the ages of the population to infer the chance of a given person dying of covid. However if this was accurate then the predicted median age of death would be ~80 instead of 65-70.

Ideas

  • get better estimate of ages of those that died
    • take ages from cases - 11 days (or current avg days till death from confirmation).
    • get more data on ages of deaths - #34
    • some combination?
    • these should raise the estimate since both should lower ages than the population?
  • factor in comorbities.
    • Said to be more in thailand than global average. True?
    • if true, how to modify the IFR to factor this in?
    • if true will lower the infections estimate
  • Use proper distribution for IFR age ranges
  • estimates of untested deaths? (estimated as 1-1.2x)
    • Use excess deaths corrected for expected changes in suicide and road deaths?
    • Including untested deaths estimate would raise the infections estimate

Delete viz-updates branch

@djay The viz-updates branch was originally intended for visualisation updates but these have been covered through other branches already. The viz-updates also contained miscellaneous updates that have nothing to do it how data is being presented (for example code cleanup, which I plan to implement at a later stage).

I suggest deleting the viz-updates branch as I am not planning on merging it or further working on it.

plot severe cases per age group

This data is currently collected from the dashboard scraping but not displayed anywhere.
Could help show the effect of vaccinations as it should show severe for elderly trending down and severe for younger trending up in a % plot

add back in the allocated vaccines

perviously this was on the cumulative vaccine plot and also the vac_daily (was a runway amount, ie how many per day to run out in 1 week).

hover to get detail on numbers (interactive js plot)

The idea would be to have similar ability to view details for every day of every line similar to OWID. Could also show actual and average values this way.

Screen Shot 2021-11-28 at 11 58 39 am

Consider the time to reimplement all the customisations in plot function vs How nice the interactivity or plotting is

Proposed solution

  • SVG with embedded JS
    • prototype in #217 based on Tooltip example but instead set a single mouseover on the main plot rectangle and then work out the date and update tooltip based on this
    • Pro: plots look exactly the same with no extra effort. Only have to get this custom JS to work.
    • Con: lose ability to copy and paste plots (but might be able to add in a button to do this?)
    • Con: we can't have the JS work and cut and paste the image at the same time but with some JS we can maybe add right click or button to do the same thing later if we want.

Alternative solutions

  • MPLD3

  • Redo plot_area with plotly

    • or bokah or others
    • Would have to start again with laying everything out nicely but tools for doing that might be nicer
    • Could still save as png on server so users can have export?
    • Seems like pandas plot has direct support for plotly but not sure we can use that
    • e.g. js based like plotly
  • mplh5canvas?

Feature Request: Increase legend and title font sizes via rc.params

Problem:

Desktop viewers viewing on https://djay.github.io/covidthailand/ can zoom in or click to expand image sizes.
Desktop viewers viewing on https://github.com/djay/covidthailand/ cannot zoom in, as github css maintains image (/column) width.

This can make reading legends and title headings more difficult.

Recommended Solution:

matplotlib rc.params default settings are around this line:

plt.rcParams.update({'font.size': 16})

Add / update rc.params for these items, until making those plots easily read at a quick glance (rc-params-list):

  • legend.fontsize
  • axes.titlesize
  • figure.titlesize
  • figure.titleweight

Then:

  • Execute plot generating code.
  • Check plot headings / legends can be read more easily, yet still maintain good aesthetic appearance.

Expected Outcomes:

  • No warnings or error while running plot generating code.
  • Plot Headings and Plot Legends are easily read "at a glance", while maintaining a good aesthetic appearance.

find source for "Adverse Reactions After Vaccination"

used to appear in each vaccination report


พบผู้ที่มีอำกำรไม่พึงประสงค์ภำยหลังได้รับวัคซีนรุนแรงที่ได้รับกำรยืนยันจำกคณะผู้เชี่ยวชำญ* รวม 18 รำย โดย 15 รำย มีอำกำร แพ้รุนแรง (Anaphylaxis) และมีอำกำรชำ (polyneurophaty) 3 รำย หลังรักษำอำกำรเป็นปกติดี และไม่พบผู้เสียชีวิตที่มีสำเหตุมำจำกวัคซีน ช่วงเวลำ | จำนวนผู้มีอำกำรไม่พึงประสงค์รุนแรง  ที่ได้รับกำรยืนยันจำกคณะผู้เชี่ยวชำญ*
-- | --
สะสมตั้งแต่วันที่ 28 กุมภำพันธ์ 2564 | 18
แพ้รุนแรง (Anaphylaxis) | 15
ชำ (polyneurophaty) | 3

no data source for vaccinations details

the following is no longer in the report

  • allocations
  • per province risk groups or totals

All we have now is total + % for each shot of top 10 provinces vaccinated.

Options

  • find another source
  • show by prvovince but with "rest of thailand" section. - remove other plots/exports

linked list of sources (csv and html)

To make it easy for people to get access to the source data if they want.

Listed by date.

  • Briefings
  • situation reports
  • testing
  • vaccination
  • by area/province

Single python file too big

Needs to be broken up but into how many files?

The code roughly breaks down like this

  • utils - one file or many?
    • scraping - getting and converting to text. web_links, web_files, dav_files, parse_file
    • converting pptx charts
    • twitter scrapping utils
    • some generic abstract utils like split and pairwise (mainly used to parse tables from text)
    • some pandas utils like fuzzy_join, trendline, topprov, export, import, human_format
    • date scraping utils (mainly thai dates)
    • province matching
      • move mispellings to csv file
      • area_crosstab
  • thai covid scraping
    • situation reports
      • split thai scraping into smaller functions
    • RB tweets
      • split get_cases_by_prov_tweets
    • briefing reports
      • split briefing_deaths (2-3 different types of tables scrapped)
    • testing data
    • vaccination reports
    • covid19daily
      • risk matching. put into CSV
    • ability to run scrape without plot. and pick what to scrape from commandline/env?
  • ploting/analysis
  • could get split up by data used. e.g. vaccines vs cases.
  • run plot without scrape (similar to use_cache_data now)

Not sure yet if there is one utils file or many?

  • scraping.py
  • thaidates.py
  • thaiprov.py
  • pdutils.py
  • collect.py
  • plot.py

Dataset request

Since the official API for cases, deaths, hospitalization data (https://covid19.th-stat.com/en/api) is unreliable: it's not regularly maintained and can't access from outside of Thailand (yep what the hell) which prevents server-side request; can you please add total death, cases and hospitalization numbers in the Daily Situation Reports?

cannot run code locally because province names relies on datasource no longer available

The cases data from https://covid19.th-stat.com/api/open/cases which is used to build provinces data is currently down and causes error while scraping.

_, cases = next(web_files("https://covid19.th-stat.com/api/open/cases", dir="json", check=False))

EDIT

proposed fix

  1. export all the current mapping to csv and commit. should be just Province->ProvinceEn mapping.
  2. replace current sources with single import of local province_mapping.csv
  3. remove old code (or move to unused function)
  4. should health districts also be kept locally?

Question: would it be possible to get data on vaccine administered by brands?

From these MOPH daily slides on the 2nd page- https://ddc.moph.go.th/uploads/ckeditor2//files/Slide%202021-07-30.pdf

Example data

{
    "date":"2021-07-04",
    "Sinovac_daily": 20170,
    "AstraZeneca_daily": 61284,
    "Sinopharm_daily": 17151,
    "AstraZeneca_total": 4061982,
    "Sinopharm_total": 95076,
    "Sinovac_total": 6513839
  },

The Researcher has already collect this type of data but it's incomplete with some dates missing some data.
https://github.com/porames/the-researcher-covid-bot/blob/master/components/gis/data/vaccine-manufacturer-timeseries.json

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.