GithubHelp home page GithubHelp logo

starcoder-python's Introduction

StarCoder: computational intelligence for humanities scholarship

StarCoder is a machine learning framework designed for researchers in fields such as History, Literary Criticism, or Classics who are interested in what cutting-edge neural models can reveal about their objects of study. It accomplishes this by following several principles:

  1. Focus on data: the humanist need only worry about how they represent their material, which is a critical aspect of empirical studies, computational or otherwise. By using JSON-LD to describe entities and their relationships, StarCoder encourages humanists to produce completely general, transparent, and explicit archives.
  2. Minimal dependencies: StarCoder is built from the ground up in PyTorch, its only dependency other than a recent version of Python.
  3. Unsupervised
  4. Flexible training
  5. Interpretable output

At the same time, StarCoder is also designed for machine learning researchers

StarCoder's goal is to programmatically generate, train, and employ neural models tailored to complex data sets, thus allowing experts in other fields to remain focused on their particular domain, while benefiting from advancements in machine learning. StarCoder models can be used for supervised and unsupervised tasks, such as classification, augmentation, cleaning, clustering, anomaly detection, and so forth. It assumes a typed Entity-relationship model specified in human-readable JSON conventions. StarCoder combines graph-convolutional networks, autoencoders, and an open set of encoder-decoder pairs. In brief, the procedure is:

  1. For each field, create an encoder that translates its type into a fixed-length representation
  2. For each entity-type, create an initial autoencoder for the concatenated output of its fields' representations
  3. Stack additional autoencoder layers that include bottlenecks from adjacent entities in the previous layer
  4. Again for each field, create a decoder that translates the autoencoder output back into the field's type

In the simplest case, the model is trained on end-to-end reconstruction loss, and can be used as a similarity measure (e.g. cosine between bottleneck representations), for anomaly detection (e.g. entropy of softmax outputs), or to dynamically explore relationships (e.g. dynamically editing fields). My selectively masking fields, it can also be trained as a classifier, and a variety of dropout and perturbations can be applied to fit different needs.

See the tutorial for thorough examples of using StarCoder. As a quick intro to what is required from the domain experts, consider someone studying an email archive. Two files are required, both in JSON format: a "schema", and a list of entities. For the email example, the schema might be:

{
  "entity_type_field" : {"type" : "entity_type"},
  "id_field" : {"type" : "id"},
  "content" : {"type" : "text"},
  "date" : {"type" : "date"},
  "name" : {"type" : "text"},
  "role" : {"type" : "categorical"},
  "age" : {"type" : "numeric"}
  "sent_by" : {"type" : "relation",
	           "source_entity_type" : "email",
			   "target_entity_type" : "person"},
  "received_by" : {"type" : "relation",
                   "source_entity_type" : "email",
	               "target_entity_type" : "person"}
}

while the list of entities might be:

[
  {"id" : "1", "entity_type_field" : "person", "person_name", "age" : 30, "name" : "Chris", "role" : "supervisor"},
  {"id" : "2", "entity_type_field" : "person", "person_name", "age" : 20, "name" : "Lynn", "role" : "employee"},
  {"id" : "3", "entity_type_field" : "email", "sent_by" : "1", "received_by" : "2", "date" : "23-3-2020", "content" : "Get back to work!"},
  ...
]

StarCoder uses this format directly, but also includes adapters for common formats, particularly tabular (CSV) data.

Representational conventions

In all cases, tensor dimensions are ordered by increasing specificity. Admittedly this is somewhat vague and interpretive, but a few concrete examples:

  1. The first dimension always indexes the batch
  2. If the field type is grid-like (sequence, image, etc), dimensions that indicate the location always immediately follow the batch: for example, a sequence of word-embeddings will have shape (BATCH_SIZE x SEQUENCE_LENGTH x EMBEDDING_SIZE), an RGB image will have shape (BATCH_SIZE x WIDTH x HEIGHT x 3), etc.

As the image example shows, somewhat-arbitrary choices are needed (HEIGHT could have preceded WIDTH), and these will be documented as they arise and new field types are defined. It is very likely that StarCoder will switch to named tensor dimensions and this will become moot.

Next steps

See the todo list for work that is planned or under-way.

starcoder-python's People

Contributors

tomlippincott avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.