GithubHelp home page GithubHelp logo

nadiafank / chart-llm-vega-lite-charts Goto Github PK

View Code? Open in Web Editor NEW

This project forked from hyungkwonko/chart-llm

0.0 0.0 0.0 119.92 MB

Vega-Lite Chart Dataset and NL Generation Pipeline using LLMs

Home Page: https://hyungkwonko.info/chart-llm/

License: MIT License

JavaScript 12.34% Python 86.66% CSS 0.11% HTML 0.89%

chart-llm-vega-lite-charts's Introduction

NL generation framework using Vega-Lite specifications and LLMs

img

We introduce an LLM framework for generating various NL datasets from Vega-Lite specifications. Our framework comprises of two novel prompting techniques to synthesize relevant chart semantics accurately and to enhance syntactic diversity, respectively. First, we leverage guided discovery so that LLMs can steer themselves to create varying NL datasets in a self-directed manner. In detail, the framework analyzes and integrates chart semantics (e.g., mark, encoding) with our scaffolding in accordance with the characteristics of each NL dataset. Also, by answering on key questions, it autonomously concentrates on the chart’s key features or propose high-level decisions. Second, we utilize score-based paraphrasing to increase the syntactic diversity of the generated NL datasets. We define four language axes (clarity, expertise, formality, subjectivity) using automatic qualitative coding, and the framework augments the generated datasets quantitatively along these axes.

img

We also present a new collection of 1,981 Vega-Lite specifications, which is used to demonstrate the generalizability and viability of our NL generation framework. This collection is the largest set of human-generated charts obtained from GitHub to date. It covers varying levels of complexity from a simple line chart without any interaction (i.e., simple) to a chart with four plots where data points are linked with selection interactions (i.e., extra complex). As we focus on collecting complex charts, more than 86% of them are in complex and extra complex levels. Compared to the benchmarks, our dataset shows the highest average pairwise edit distance between specifications, which proves that the charts are highly diverse from one another. Moreover, it contains the largest number of charts with composite views, interactions (e.g., tooltips, panning & zooming, and linking), and diverse chart types (e.g., map, grid & matrix, diagram, etc.). Also refer to our website to see the charts. The metdata for charts including the licenses for each chart is presented here.

Loading the dataset via Huggingface

Please refer to this code:

import json
from datasets import load_dataset

dataset = load_dataset("hyungkwonko/chart-llm", data_files="data.txt")
json_data = [json.loads(data) for data in dataset["train"]["text"]]
print(f"len(json_data): {len(json_data)}")

Example NL Dataset Generation

Please prepare Open-AI API KEY and locate the .env file in the root:

OPENAI_API_KEY=your-openai-api-key

To run the example codes, refer to below:

# L1 Caption Generation (Caption)
python -m framework_sample.task1_L1

# L2 Caption Generation (Caption)
python -m framework_sample.task1_L2

# Utterance (NL to Chart Generation)
python -m framework_sample.task2

# Question (Chart Question Answering)
python -m framework_sample.task3

NL Dataset Paraphrasing

img

Our paraphrasing technique is inspired by a linear interpolation in the latent space for image generation and manipulation as demonstrated in many system and application papers. It enables a smooth transition from one expression to another by focusing on creating controllable and meaningful variations of a single sentence. We employ a five-point Likert-scale and assign one of the axes to each. Here, we focus on altering only the sentence's syntax, while maintaining its meaning. In detail, we provide LLMs with a sentence (i.e., Example Sentence) and an explanation about one of the defined axes (i.e., Axis) and its two directions (i.e., Direction-1, Direction-2). We assign a specific value on a Likert scale ranging from one to five, to paraphrase the sentence as if it were spoken by a person using a language with a certain degree indicated by the score.

Examples NL Dataset Paraphrasing

# Paraphrasing with Random one/two Axes
python -m framework_sample.task4_1

# Paraphrasing with one Axis and print them all (5)
python -m framework_sample.task4_all1

# Paraphrasing with two Axes and print them all (5*5=25)
python -m framework_sample.task4_all2

Citation

You can use the following bibtex to cite our work:

@misc{ko2023natural,
  title={Natural Language Dataset Generation Framework for Visualizations Powered by Large Language Models}, 
  author={Hyung-Kwon Ko and Hyeon Jeon and Gwanmo Park and Dae Hyun Kim and Nam Wook Kim and Juho Kim and Jinwook Seo},
  year={2023},
  eprint={2309.10245},
  archivePrefix={arXiv},
  primaryClass={cs.HC}
}

chart-llm-vega-lite-charts's People

Contributors

hyungkwonko avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.