djay / covidthailand Goto Github PK

Thailand Covid testing and case data gathered and combined from various sources for others to download or view

Python 99.99% Ruby 0.01% SCSS 0.01% Shell 0.01% Dockerfile 0.01%

daily-situation-reports ccsa-daily-briefing confirmed-cases proactive-tests health-district covid19-data covid-api thailand hacktoberfest hacktoberfest2021

covidthailand's People

Contributors

Stargazers

Watchers

Forkers

porames barryprice limhengue atthakorn page2me pubst0rm intelligent-bytes earth1517 modiholodri patpichatan kinshukdua wasdee andamanopal manhanton

covidthailand's Issues

Plot nationality cases trends

Not that interesting but should be there.

testing webdav is down (or gone) so breaking the run

probably best way is to change the code in

covidthailand/utils_scraping.py

Line 350 in 5151cab

def dav_files(url, username=None, password=None,

so that if there is an error connecting it can fall back to listing the all the files in the local dir instead? assume the dir is only used for caching but should be good enough.

web_files has a similar check=True mode mode that will use cached content instead if available.

  tests_reports = get_test_reports()
  File "/home/runner/work/covidthailand/covidthailand/covid_data.py", line 2286, in get_test_reports
    for file, dl in test_dav_files(ext=".pptx"):
  File "/home/runner/work/covidthailand/covidthailand/utils_scraping.py", line 362, in dav_files
    client.list(get_info=True),
  File "/opt/hostedtoolcache/Python/3.9.7/x64/lib/python3.9/site-packages/webdav3/client.py", line 67, in _wrapper
    res = fn(self, *args, **kw)
  File "/opt/hostedtoolcache/Python/3.9.7/x64/lib/python3.9/site-packages/webdav3/client.py", line 264, in list
    response = self.execute_request(action='list', path=directory_urn.quote())
  File "/opt/hostedtoolcache/Python/3.9.7/x64/lib/python3.9/site-packages/webdav3/client.py", line 230, in execute_request
    raise ResponseErrorCode(url=self.get_url(path), code=response.status_code, message=response.content)

switch website to netlify/proper domain

Goal would be to have a better name and sharable url and a single site to go to

netlify builds working
buy domain name (not covid specific) and put on netlify
switch image urls to netlify instead of wiki and turn off wiki image uploading
turn off gh pages and switch all url references and put in some kind of redirect on gh pages to the new site?
move site to index.md and move downloads and contributions to the readme so thats all you see on github. main site points to github for downloads and contributions
move all wiki content to discussions and have FAQ section on website pointing to discussions
(unsure) split sections into pages and use menus

plots on split 7 thailand regions to see trends better than 13 health districts.

north, south, east, west, central (and maybe split out Bangkok + suburbs from central?)

e.g. https://www.facebook.com/informationcovid19/posts/322313232720341

Vaccinations report is not up to date

Vaccinations report is not up to date krub

Get better data on death ages.

I'm now collecting cumulative CFR for 3 age ranges (since 1st april) from situation reports.

However these cumulative CFR values can sometimes decrease (see "W3 CFR 15-39" on 29th). CFR can go down if cases increases but the cumulative cases I am using don't seem to match rise at the right times.

So turning this into daily deaths by age is turning out to be a bit tricky as so far I end up with negative deaths on certain days (

covidthailand/covid_data.py

Line 488 in 1f4f4fb

# CFR * cases = deaths

)
Either

the dates of cases used in their CFR calculation is not the same as found in covid-19
the lack of precision of the CFR number throws things out
the cases with missing ages in covid-19 dataset throws things out (maybe gets too large on certain days?)

Need to work out how to correct for these to get reasonable numbers that align with min, max and median of reported deaths.

show covid in relation to other causes of deaths

There are some deceptive sites like https://www.worldlifeexpectancy.com/ trying to covid yearly instead of weekly or by month.
Should be able to get all teh data out of http://ghdx.healthdata.org/gbd-results-tool but it's only yearly.
Could add additional graph to excess deaths to show estimated causes of deaths based on previous years vs covid vs excess.

💡 Feature Request: Allow dataset collection errors to occur without fatal errors. I.e. change from Fail Early to Fail Partially.

Problem: Data sources change sometimes and currently exceptions are handled by the main python process. Such that any exception will cause runs to exit with a fatal error. (Fail early)

The feature request is:

Decouple each dataset collection procedure, to report (and permit) failures in one dataset source, without halting other collection from other sources. (Fail partial)

Proposed solution:

Decouple Data Collection Functions (or Chains of Functions) (e.g. per output dataset/df or per dataset source/url):
- i.e. ensure one chain of functions handles one thing/source/dataset.
- i.e. Break apart main call graph such that each dataset collection is independent from other dataset collections.
- Specifically, such that one dataset collection can fail independently, but not hold up the other dataset collections.
Independent Execution:, either by:
- (i) add try-excepts into __main__, around each function call (i.e. entrance into call graph chain).
- (ii) Or, create separate processes for each chain of functions and let main continue to handle exceptions.

Expected Outcomes:

Changes to dataset sources that cause fatal errors, no longer hold up collection from all other dataset sources.

some plots seem to have wrong "Last Data" date

Age etc only comes from api which is often days delayed but plot says its up to date today. Need to check each plot and make sure calculation is is correct for all.

plot proactive/walkin per province

from dashboard

Fix Vaccination plots

daily needs same order groups
daily needs to handle missing data from groups better
% progress missing data and wrong colours.
progress missing the new groups
gaps in province data #67
goes to zero sometimes. dashboard data? got negative daily also
unknown shows up as 3rd dose in daily plot. handle unknown properly. #68

fix gaps vaccinations by provinces

Could be one of several problems

bad parsing on vaccination reports tables
missing data on vaccination report tables
previous mistakes in dashboard code has stored 0 or other bad data. (this data is not refreshed each time)
is dashboard data even used for this graph?
if there is no data available is it better to interpolate it?

add predicted vaccination rate to progress by groups plots

perhaps to the end of the year?

Should be able to get the rate for the last 2 weeks and plot a dotted line going forward. Maybe even a line to show where it hits 70% and 80%?

enhanced version might modify the 2nd dose estimate by factoring in time since first dose and vaccines used?

plot home isolation and hospital numbers

dashboard data has extra breakdown on field hospitals

                D_Hospitel="Hospitalized Field Hospitel",
                D_HICI="Hospitalized Field HICI",
                D_HFieldOth="Hospitalized Field Other",

Plot estimated R value

https://raw.githubusercontent.com/crondonm/TrackingR/main/Estimates-Database/database.csv

Plot allocated vaccines

Replace the 1 week runway ffom imports with allocated. Also line on total doses graphs

plots based on province wealth?

https://data.go.th/dataset/http-mis-m-society-go-th-tab030104-php

Scrape in parallel

None of the scrapers rely on each other adn can be done in any order. So should be easy to put in a worker queue and use multiple processes to process them faster on actions.

Feature: 3rd dose data

Currently the daily reports doesn't include 3rd dose data but can be found with this link (as far as i know). Would be great to include this in the vac_timeline as well.

https://dashboard-vaccine.moph.go.th/

Fix missing allocations by using % given.

According to the raw data published by MOPH, Vaccine allocation data are separated by the manufacturers (Sinovac and Astrazeneca). However, the scraped data shows Vac Allocated AstraZeneca 1 | 2 and Vac Allocated Sinovac 1 | 2. What is the difference between the two?

Also vaccine allocation data of every province is missing after 9th May 2021 .

Thanks.

EDIT *

Fixes

Improve documentation in the readme about dose 1|2. Maybe better names?
Add in Allocated 1|2 value calculated from the given %. (or using allocated where available?)

add in the remaining regression tests

A test framework has been put in place in https://github.com/djay/covidthailand/blob/main/tests/test_scraping.py#L85

Todo
[ ] - identify the parsers which break most often e.g. briefings
[ ] - rework the get_links code to into seperate function and actual download as enclosure to enable test to only the files it needs
[ ] - identify from comments or commits which dates required changes. generate a json file for each

Vac Given by brands is gone?

Hi Dylan, I noticed that the "Vac Given " is gone from the vac_timeline.csv file. Was it moved to another file or is this a bug? If this is a bug, I could try and help fix it.

Nat

feature: plot trending clusters

briefing has clusters listed. Not sure if its all of them or some.
Might be interesting to show top 5 clusters that a growing
Unique id would be date + cluster name? (maybe province/s?)
Cluster name would probably have to be in thai (but at least province would be translated)

Add right side axis labels

The most interesting data are on the right side of the chart. To have labels only on the left side makes it somehow hard to read.

Confirmed-cases..csv encoring issue

https://data.go.th/dataset/8a956917-436d-4afd-a2d4-59e4dd8e906e/resource/be19a8ad-ab48-4081-b04a-8035b5b2b8d6/download/confirmed-cases.csv

Not sure what they have done or if there is a way unencode it

Hopefully it goes away today when it next updates.
But since the API version and xlsx are 3 weeks out of date and the csv has now been overwritten in the cache, it does show a weak point in the system they I'm not sure how to get around.

Dashboard province data incorrect sometimes

Seems to happen when it needs to get more than one date is one session
for example 2021-09-01 is actually 85 not 35 but 156 is correct.
https://github.com/djay/covidthailand/runs/3492265169?check_suite_focus=true

2021-09-02 00:00:00 MOPH Dashboard Retry Missing data at (datetime.datetime(2021, 9, 2, 0, 0), 'Buriram') for ['Vac Given 1 Cum', 'Vac Given 2 Cum', 'Vac Given 3 Cum']. Retry
2021-09-02 MOPH Dashboard 156.0 0.0 0.0 0.0 156.0 0.0 NaN NaN NaN NaN Buriram
...
2021-09-01 00:00:00 MOPH Dashboard Retry Missing data at (datetime.datetime(2021, 9, 1, 0, 0), 'Buriram') for ['Vac Given 1 Cum', 'Vac Given 2 Cum', 'Vac Given 3 Cum']. Retry
2021-09-01 MOPH Dashboard 34.0 0.0 0.0 0.0 34.0 1.0 NaN NaN NaN NaN Buriram

The existing code tries to reset the connection if there is a failure. somehow it must be going wrong.
https://github.com/djay/covidthailand/blob/main/utils_scraping.py#L694

older case districts disappeared?

Some plots have legend keys missing (Provinces)

Solution Proposal for Vac Report Table

Order of columns remains the same. We can just find the row where there is a number and add column names afterward.

def parse_raw(url):
  response = requests.get(url)
  file = open("tmp/daily_report.pdf", "wb")
  file.write(response.content)
  file.close()
  tables = camelot.read_pdf('tmp/daily_report.pdf', pages='2,3',split_text=True)
  raw_table = pd.DataFrame()
  for i in range(2):
    df=tables[i].df
    df=df[df[1].str.isdigit()]
    df.drop([2], axis=1, inplace=True)
    raw_table = raw_table.append(df,ignore_index=True)
  table_dict=raw_table.transpose().to_dict()
  rows=[]
  for row_num in table_dict:
    cleaned_row=[]
    for (key,value) in table_dict[row_num].items():
      for col in value.replace(" ", "").split('\n'):
        if(col): cleaned_row.append(col)
    rows.append(cleaned_row)
  cleaned_table = pd.DataFrame(rows)
  return cleaned_table
df=parse_raw("https://ddc.moph.go.th/vaccine-covid19/getFiles/9/1628485849393.pdf")
test = df.iloc[:,0:12]
test.columns=["Health Area", "Population", "Vac Allocated AstraZeneca", "Vac Allocated Sinovac", "Vac Allocated Pfizer", "Vac Allocated Total", "Vac Given 1 Cum", "Vac Given 1 %", "Vac Given 2 Cum", "Vac Given 2%", "Vac Given 3 Cum", "Vac Given 3 %"]
display(test)

Better website + thai translation

incorrect 3rd dose on daily vaccines when vaccine report delayed

https://github.com/djay/covidthailand/wiki/vac_groups_daily_30d.png

The reason is we have data for 1st, 2nd, 3rd dose earlier than the breakdowns per group.
Dashboard and briefings give us the dose totals. Vac reports give us the groups.

So need to show this data properly. Proberly with a 1st dose, 2nd dose category which is only show on days where the other data is missing. Similar to the unknown feature of teh plot_area function but in this case we have 3 types of data we can show when groups aren't available.

fix some plots going to zero

Improved Infections estimate

The current infections estimate uses the ages of the population to infer the chance of a given person dying of covid. However if this was accurate then the predicted median age of death would be ~80 instead of 65-70.

Ideas

get better estimate of ages of those that died
- take ages from cases - 11 days (or current avg days till death from confirmation).
- get more data on ages of deaths - #34
- some combination?
- these should raise the estimate since both should lower ages than the population?
factor in comorbities.
- Said to be more in thailand than global average. True?
- if true, how to modify the IFR to factor this in?
- if true will lower the infections estimate
Use proper distribution for IFR age ranges
- current interpretation of the IFR ranges is not a curve but stepped. Should use a fitted curve.
- see https://colab.research.google.com/drive/1PKiMzdEff1ZQUMqJzfo1vwBOs2i7oUEO#scrollTo=OAh4Rfzn9UvE
- should lower the estimate as currently risk is flat after 65 but this would increase risk up until 100+
estimates of untested deaths? (estimated as 1-1.2x)
- Use excess deaths corrected for expected changes in suicide and road deaths?
- Including untested deaths estimate would raise the infections estimate

Delete viz-updates branch

@djay The viz-updates branch was originally intended for visualisation updates but these have been covered through other branches already. The viz-updates also contained miscellaneous updates that have nothing to do it how data is being presented (for example code cleanup, which I plan to implement at a later stage).

I suggest deleting the viz-updates branch as I am not planning on merging it or further working on it.

Use DDC vaccination report as provincial vaccination data source

Data from COVID-19 Vaccines Track and Traceability Platform is no longer reliable (I hope just for temporary). Meanwhile, DDC report now includes numbers for provincial vaccination. Perhaps you could modify the old code to scrape data from it.

fix colours for actuals and unknowns

ensure all unknowns are same grey colour
make actuals lighter version of MA

plot severe cases per age group

This data is currently collected from the dashboard scraping but not displayed anywhere.
Could help show the effect of vaccinations as it should show severe for elderly trending down and severe for younger trending up in a % plot

plots for positive rate trending up and top 5 provinces

The test data from dashboard is a bit suspect but see what it shows.

add back in the allocated vaccines

perviously this was on the cumulative vaccine plot and also the vac_daily (was a runway amount, ie how many per day to run out in 1 week).

hover to get detail on numbers (interactive js plot)

The idea would be to have similar ability to view details for every day of every line similar to OWID. Could also show actual and average values this way.

Consider the time to reimplement all the customisations in plot function vs How nice the interactivity or plotting is

Proposed solution

SVG with embedded JS
- prototype in #217 based on Tooltip example but instead set a single mouseover on the main plot rectangle and then work out the date and update tooltip based on this
- Pro: plots look exactly the same with no extra effort. Only have to get this custom JS to work.
- Con: lose ability to copy and paste plots (but might be able to add in a button to do this?)
- Con: we can't have the JS work and cut and paste the image at the same time but with some JS we can maybe add right click or button to do the same thing later if we want.

Alternative solutions

MPLD3
- Enabling output to HTML/JS is pretty easy - https://github.com/djay/covidthailand/pull/61/files
- It mostly works but layouts and fonts are a little off
- There isn't out of the box crosshair plugin (but some D3 js code that comes close)
Redo plot_area with plotly
- or bokah or others
- Would have to start again with laying everything out nicely but tools for doing that might be nicer
- Could still save as png on server so users can have export?
- Seems like pandas plot has direct support for plotly but not sure we can use that
- e.g. js based like plotly
mplh5canvas?

Feature Request: Increase legend and title font sizes via rc.params

Problem:

Desktop viewers viewing on https://djay.github.io/covidthailand/ can zoom in or click to expand image sizes.
Desktop viewers viewing on https://github.com/djay/covidthailand/ cannot zoom in, as github css maintains image (/column) width.

This can make reading legends and title headings more difficult.

Expected Outcomes:

No warnings or error while running plot generating code.
Plot Headings and Plot Legends are easily read "at a glance", while maintaining a good aesthetic appearance.

find source for "Adverse Reactions After Vaccination"

used to appear in each vaccination report


พบผู้ที่มีอำกำรไม่พึงประสงค์ภำยหลังได้รับวัคซีนรุนแรงที่ได้รับกำรยืนยันจำกคณะผู้เชี่ยวชำญ* รวม 18 รำย โดย 15 รำย มีอำกำร แพ้รุนแรง (Anaphylaxis) และมีอำกำรชำ (polyneurophaty) 3 รำย หลังรักษำอำกำรเป็นปกติดี และไม่พบผู้เสียชีวิตที่มีสำเหตุมำจำกวัคซีน ช่วงเวลำ | จำนวนผู้มีอำกำรไม่พึงประสงค์รุนแรง&nbsp; ที่ได้รับกำรยืนยันจำกคณะผู้เชี่ยวชำญ*
-- | --
สะสมตั้งแต่วันที่ 28 กุมภำพันธ์ 2564 | 18
แพ้รุนแรง (Anaphylaxis) | 15
ชำ (polyneurophaty) | 3

forward predictions for top and bottom province vaccinations

e.g. https://github.com/djay/covidthailand/wiki/vac_top5_doses_2_30d.png

Can reuse the existing code used for groups - https://github.com/djay/covidthailand/blob/main/covid_plot.py#L966

no data source for vaccinations details

the following is no longer in the report

allocations
per province risk groups or totals

All we have now is total + % for each shot of top 10 provinces vaccinated.

Options

find another source
show by prvovince but with "rest of thailand" section. - remove other plots/exports

linked list of sources (csv and html)

To make it easy for people to get access to the source data if they want.

Listed by date.

Single python file too big

Needs to be broken up but into how many files?

The code roughly breaks down like this

utils - one file or many?
- scraping - getting and converting to text. web_links, web_files, dav_files, parse_file
- converting pptx charts
- twitter scrapping utils
- some generic abstract utils like split and pairwise (mainly used to parse tables from text)
- some pandas utils like fuzzy_join, trendline, topprov, export, import, human_format
- date scraping utils (mainly thai dates)
- province matching
  - move mispellings to csv file
  - area_crosstab
thai covid scraping
- situation reports
  - split thai scraping into smaller functions
- RB tweets
  - split get_cases_by_prov_tweets
- briefing reports
  - split briefing_deaths (2-3 different types of tables scrapped)
- testing data
- vaccination reports
- covid19daily
  - risk matching. put into CSV
- ability to run scrape without plot. and pick what to scrape from commandline/env?
ploting/analysis
could get split up by data used. e.g. vaccines vs cases.
run plot without scrape (similar to use_cache_data now)

Not sure yet if there is one utils file or many?

scraping.py
thaidates.py
thaiprov.py
pdutils.py
collect.py
plot.py

Dataset request

Since the official API for cases, deaths, hospitalization data (https://covid19.th-stat.com/en/api) is unreliable: it's not regularly maintained and can't access from outside of Thailand (yep what the hell) which prevents server-side request; can you please add total death, cases and hospitalization numbers in the Daily Situation Reports?

cannot run code locally because province names relies on datasource no longer available

The cases data from https://covid19.th-stat.com/api/open/cases which is used to build provinces data is currently down and causes error while scraping.

covidthailand/covidthailand.py

Line 1279 in b61ac75

 _, cases = next(web_files("https://covid19.th-stat.com/api/open/cases", dir="json", check=False)) 

EDIT

proposed fix

export all the current mapping to csv and commit. should be just Province->ProvinceEn mapping.
replace current sources with single import of local province_mapping.csv
remove old code (or move to unused function)
should health districts also be kept locally?

Question: would it be possible to get data on vaccine administered by brands?

From these MOPH daily slides on the 2nd page- https://ddc.moph.go.th/uploads/ckeditor2//files/Slide%202021-07-30.pdf

Example data

{
    "date":"2021-07-04",
    "Sinovac_daily": 20170,
    "AstraZeneca_daily": 61284,
    "Sinopharm_daily": 17151,
    "AstraZeneca_total": 4061982,
    "Sinopharm_total": 95076,
    "Sinovac_total": 6513839
  },

The Researcher has already collect this type of data but it's incomplete with some dates missing some data.
https://github.com/porames/the-researcher-covid-bot/blob/master/components/gis/data/vaccine-manufacturer-timeseries.json

Vaccination by Province Data has Jumps in it

As seen here in graphs five through eight: Top/Lowest Provinces for Vaccination 2nd/1st Dose.