- Get PDF
- Download online
- Transform PDF to XML
- use FoxitPDF convert to XML
- parse XML to csv
- run notebooks/data_processing/xml_parse.ipynb
- Inference or Finetune
- Inference (use esgbert directly)
- Run notebooks/model_training/esgbert_inference.ipynb
- Finetune & Inference (use gpt-3.5 finetune and inference)
- Run notebooks/data_processing/openai.ipynb
- Run notebooks/data_processing/multi_gpt_label_clean.ipynb
- Run notebooks/model_training/esgbert_finetune.ipynb
- Inference (use esgbert directly)
- Visualization (Radar chart)
- Run notebooks/visualization/main_radar.ipynb
pip install -r requirements.txt
- Python 3.10
- CUDA 12.1
- Change hyper parameter here
# american or taiwan
NATION = 'american'
# xml source path
XML_PATH = '../../data/xml'
# csv source path
CSV_PATH = '../../data/csv_source/'
# min sentence length below will be ignore (to maintain the quality of sentence)
MIN_SENTENCE_LEN = 10
- Run the cell
xml_parse = XmlParse(NATION, XML_PATH, CSV_PATH, MIN_SENTENCE_LEN)
xml_parse.parse_xml(file_list)
- Open notebooks/model_training/esgbert_inference.ipynb
# csv path after xml_parse
CSV_SOURCE = '../../data/csv_source'
# csv path inference output
CSV_OUTPUT = '../../data/csv_output'
# american or taiwan you want to put, or can keep empty
NATION = ''
# the csv file you want to inference
FILE = '4_apple_1_72'
- Run the cell
esgbert = EsgBertPredict(CSV_SOURCE, CSV_OUTPUT, NATION, FILE, HYPER_PARAMETERS)
esgbert.main()
- copy notebooks/data_processing/.env.example to notebooks/data_processing/.env
- paste your openai_api_key in .env
OPENAI_API_KEY=Your openai key
# american or taiwan
NATION = 'american'
# csv source path
CSV_SOURCE_PATH = '../../data/csv_source'
# csv output path
CSV_OUTPUT_PATH = '../../data/csv_gpt_label'
# max token length (because we desire only output number label like [3, 4, 5] so 10 token is enough, and can save your money if accident happen)
MAX_TOKENS = 10
- Run the cell
openai_label = OpenAILabel(file_list, NATION, CSV_SOURCE_PATH, CSV_OUTPUT_PATH, MAX_TOKENS)
openai_label.loop_folder()
NATION = 'american'
# the csv after gpt label
SOURCE_PATH = '../../data/csv_gpt_label/'
# where you want to output the clean and concate csv
OUTPUT_PATH = '../../data/csv_gpt_label/'
- Run the cell
clean_multi_gpt_label = CleanMutliGptLabel(file_list, NATION, SOURCE_PATH, OUTPUT_PATH)
clean_multi_gpt_label.loop_folder()
- Open notebooks/model_training/esgbert_finetune.ipynb
- Change hyper parameter here
HYPER_PARAMETERS = {
'batch_size': 8,
'lr': 1e-5,
'epoch': 1,
# multilabel output threshold
'threshold': 0.02
}
- change directory as you want
# csv path after gpt mark label
TRAIN_CSV_PATH = '../../data/csv_gpt_label'
# csv path you want to predict
TEST_CSV_PATH = '../../data/csv_source'
# csv path after finetune esgbert predict
TEST_CSV_OUTPUT_PATH = '../../data/csv_output'
# american or taiwan or ''
NATION = 'american'
# train csv file name
TRAIN_SOURCE = 'gpt_small'
# test csv file name
TEST_SOURCE = '4_apple_small'
#
TEST_OUTPUT = 'finetune_output'
- Run the cell
esgbert = EsgBertPredict(TRAIN_CSV_PATH, TEST_CSV_PATH, TEST_CSV_OUTPUT_PATH, NATION, TRAIN_SOURCE, TEST_SOURCE, TEST_OUTPUT, HYPER_PARAMETERS)
esgbert.main()
- Open notebooks/visualization/main_radar.ipynb
# csv path contain american and taiwan csv
CSV_PATH = f'../../data/csv_output'
# output png folder
PNG_PATH = f'../../data/png'
# which nation you are drawed
NATION = 'american'
draw = MultiRadar(NATION, PNG_PATH, CSV_PATH, label_cnt=27)
- Choose which you want to draw
- draw_average(csv_path): draw the average radar chart for the csv folder
- draw_full25(csv_path): draw all the company radar chart int the csv folder
- draw_company_average(csv_path): draw all company vs average in the csv folder
- taiwan_vs_american(): draw taiwan average vs american average
# choose picture to draw
draw.draw_average(draw.AMERICAN_CSV_PATH)
# draw.draw_full25(draw.AMERICAN_CSV_PATH)
# draw.draw_company_average(draw.AMERICAN_CSV_PATH)
# draw.taiwan_vs_american()