18f / data-federation-project Goto Github PK

A project focused on tools and best practices to supported federated data collection efforts

10x-data-federation data-federation ingest-data opendata

data-federation-project's Introduction

Data Federation Project

The U.S. Data Federation project promotes government-wide capacity-building to support distributed data management challenges, data interoperability, and broader data standards activities. The project is an initiative of the GSA Technology Transformation Services (TTS) 10x program, which funds technology-focused ideas from federal employees with an aim to improve the experience all people have with our government.

Overview

U.S. government policies, initiatives, and public-facing products and services depend on aggregating and harmonizing data from disparate government sources. The goal of the U.S. Data Federation project is to document repeatable processes, develop reusable tooling, and curate resources to support federated data projects.

We define a federated data project as an effort in which a common type of data is collected or exchanged across complex, disparate organizational boundaries. For example, federal agencies often need to collect data from state and local governments, other federal agencies, and other data providers. These federated data may be used to support policy or budget decisions, operational efficiencies, or published in aggregate form for other data users.

Federated data efforts are increasingly seen as an engine for transparency, economic growth, and accountability, yet collecting this kind of data remains a challenge. While this type of data management effort is growing increasingly common in our distributed style of government, each new effort is still improvising solutions in terms of processes, tooling, and compliance infrastructure. Many of these federated data efforts face common requirements and common challenges, but lack common resources.

The U.S. Data Federation project was conceived in 2016 to address this gap. The project set out to identify common challenges and pain points in federated data efforts and address these needs by curating best practices and resources and developing reusable tooling. The best practices and resources were intended to include guides and repeatable processes around data governance, organizational coordination, and standards development in federated environments. The reusable tools were intended to include capabilities around data validation, automated aggregation, and the development and documentation of data specifications.

Over the course of its first three phases of 10x funding, it began to deliver on this ambition by building and launching ReVal, a Reusable Validation Library, which has been used by the USDA Food & Nutrition Services and other agencies to streamline data collection and validation processes.

During Phase 4, the team took advantage of a unique opportunity to unite government-wide efforts to support open data and federated data efforts. The team has supported Data.gov, OMB, and OGIS stakeholders in developing a vision and delivering increased functionality for resources.data.gov, a legislatively-mandated online repository of policies, tools, case studies and other resources to support data governance, management, and use throughout the federal government.

After conducting research with the stakeholders and audience for resources.data.gov, the team saw an opportunity for a long-term practical manifestation of the Data Federation as the content strategy team underpinning resources.data.gov. The future funding and organization of this work is currently under negotiation.

Project milestones

Phase 1 (Fall 2017)

Team: Phil Ashlock, Anthony Garvan

Interviewed a variety of distributed data management projects and synthesized findings in a Data Federation Framework.
Created a placeholder for future web content at federation.data.gov
Pitched for Phase 2 funding based on finding that reusable tooling and processes would benefit future federated data efforts

Phase 2 (Spring 2018)

Team: Phil Ashlock, Catherine Devlin, Anthony Garvan, Chris Goranson, Joe Krzystan

Prototyped a reusable data validation tool that allows users to submit data via a web interface or API to be validated against a set of customizable rules in real time
Partnered with the USDA Food & Nutrition Service (FNS) to adapt this tool for the FNS-742, a form that collects verification data for the National School Lunch Program
Pitched for Phase 3 funding to further develop the tool, implement it with FNS, and conduct outreach to identify other partners and other opportunities for reusable tools (Phase 2 Final Presentation)

Phase 3 (December 2018-June 2019)

Team: Phil Ashlock, Mike Gintz, Mark Headd, Ethan Heppner, Julia Lindpaintner, Amy Mok

Developed Phase 2 prototype into Reusable Validation Library (ReVal) with a focus on API-based usage
Worked with FNS to develop ReVal's first custom manifestation for FNS-742 as the FNS Data Validation Service
Validated demand for ReVal and identified future partners
Continued to identify common needs and useful reusable resources for data efforts through outreach and presentations to the Data Exchange Community of Practice, Interagency Working Group on Open Data, VA Open Data Working Group, and others
Began building a community around a shared need for knowledge-sharing across data efforts in government
Protect against redundancy by aligning the efforts of the U.S. Data Federation with other efforts across government, such as the work of the Federal Data Strategy and the mandates of the Evidence Act and Open Government Data Act
Pitched for Phase 4 funding to leverage the completion of ReVal and the momentum of the U.S. Data Federation work to support a long-term vision and strategic plan for a user-centered, maximally-effective resources.data.gov

Phase 4 (October 2019-April 2020)

Team: Phil Ashlock, Mike Gintz, Julia Lindpaintner, Amy Mok, Princess Ojiaku, James Tranovich

Collaborated with resources.data.gov stakeholders (GSA, OMB, OGIS) to identify likely audience and begin to define a long-term vision of success for the resource repository
Conducted interviews with over 30 people across 14 agencies, including 5 Chief Data Officers, data scientists, organizers of internal open data working groups, Federal Data Strategy detailees, and many others involved in their agency’s data governance or data management efforts
Outlined a long-term vision for the U.S. Data Federation as the content strategy team underpinning resources.data.gov and plans to prototype this approach during Phase 4
Reviewed all content and implemented new information architecture, navigation, and functionality in response to user research in order to make resources in the repository maximally discoverable
Prototyped the process of abstracting agency-specific resources to make them more broadly useful to other agencies

References and deliverables

Phase 1

Interview notes
Preliminary Findings
US Data Federation Framework (includes Data Federation Maturity Model and Data Federation Playbook)

Phase 2

Final presentation (PDF)
Prototype Django Data Ingest Tool

Phase 3

Project overview (PDF)
Project overview with FNS case study (PDF)
Project overview presentation (PDF)
Webinar via Digital.gov on April 17, 2019 // Slides (PDF)
18F blog post on U.S. Data Federation project
Federal Data Strategy Proof Point on the USDA FNS Data Validation Service
Phase 4 pitch presentation (PDF)

Phase 4
Forthcoming

Biweekly updates
Starting in Phase 3, the team began publishing updates on its activities and progress roughly bi-weekly. All past updates can be found here.

Related repositories

There are several repositories that contain code that is a part of this project.

Github repo for the US Data Federation website: https://github.com/GSA/us-data-federation
Github repo for ReVal, the US Data Federation's first reusable tool for data validation and aggregation: https://github.com/18F/ReVAL
Github repo for the FNS Data Validation Service, which uses ReVal to check submitted FNS-742 data against a set of centralized validation rules via API: https://github.com/18F/usda-fns-ingest
Github repo for the Workzone Data Exchange (WZDx) Validator, which uses ReVal to perform JSON Schema validation against the WZDx v1.1 specification: https://github.com/18F/usdot-jpo-ode-workzone-data-exchange

Other repos referenced:

data-federation-project's People

Contributors

Stargazers

Watchers

Forkers

lanebecker ominina mheadd openglobe juliaklindpaintner ethanheppner oluwamakinwa isabella232

data-federation-project's Issues

As a policy-maker, I want to understand where the project is going so that I can know how to leverage it.

User story

As a policy-maker, I want to understand where the project is going so that I can know how to leverage it.

Project communications related items:

Acceptance criteria

Due-diligence / outreach with other policy-makers is communicated
One-pager is shared (or overview deck)
Announced on the listserv / open data list / open cabinet list, etc.

Identify that first use case example!

User story

As a <type of user>, I want <a goal> so that <benefit>.

Acceptance criteria

Help identify who that first use case example might be (keeping it simple, etc.)
[ ]

Tools & Tech

Schema Definition:

JSON Schema (data.gov, code.gov)
Open API (Open311)
Data pack (csvs + JSON) - OpenReferral
JSON-LD (NIEM, beta only)
XML Schema (NIEM)
CSVs (DATA Act)
Described in policy / law (collecting data from municipalities in CT, restaurant inspections)

Schema Documentation:

Custom static webpage
readthedocs (OpenReferral)
interactive schema explorer (NIEM)
not-published (e.g., PDFs that get emailed around)
Policy / Law

Collection:

spreadsheets
quickbooks

Validation:

None / not addressed (OpenReferral, Open311)
Built into collection tool (DATA Act)
Standalone page (data.gov, code.gov)

Aggregation / Submission:

Upload to website (DATA Act)
Email to person (inspections, municipality collection)
Self-publish to known URL (data.gov, code.gov)

Interview: VIP + Open Civic Identifiers

Introductory comments

What is the data federation effort all about? What am I looking to get out of it?
This is a collaborative research project with GSA's Office of Products & Platforms and 18F. The goal is to build a toolkit / playbook for undertaking intra-governmental data collection / aggregation projects, such as data.gov, code.gov, and NIEM. Or goal is to find out what works, what doesn't, and what tools are appropriate for what circumstances, in order to accelerate similar efforts in the future.
Notes & Report will be public
Any questions before we get started?

What is VIP / Open Civic Identifiers, in your own words?

VIP = it's a standard that allows states to report ballot / voting information, ties it to a street / address point, also information about election officials that can help them with civic questions. OCId's - schemas and tools for gathering information on government officials / political geographies. Different ways of expressing the same thing- how to link ontologies to geographies. Those identifiers used to build out the civic information API. Most commonly in use for geographies- districts, commission districts, anything where you might want to link voters to an office and a person who represents them.

What was impetus or driving force for this effort: policy, user needs, etc (perhaps after the first question)

VIP => 2008, effort between Pew and Google. Google said people are going online to find this and can't. 8 states + LA joined, now 48 states + DC. Last election 48 states publish . OCIds- google / sunlight labs / some others, realized that they all had data around identifiers but couldn't be joined. Little bit of a lull, now getting a new governance model.

Note- not just a linker, also ontological / standard names for things.

In building X, what were the biggest challenges, and what went smoothly?

VIP => biggest issue is that election offices say they already have it online, why implement VIP? Why do I need to work with you? Convey scale & partnerships. Ability to adjust specification to adopt to unique voting structure. As project evolved, able to more quickly give feedback to states. Some things run in to- "already busy as it is", spec is technical in nature, a bit of a workup to getting specification described and translated into terms they understand.

What tools and technologies do you use for these effort?

VIP -> csv / xml single file, single election, includes all relevant polling and ballot data. VIP hosts server, states submit it there, gets validated and packaged and sent to google and they include voting API.
xsd accompanies spec, also web based tools summarizes data and fixes broken data links. custom for the project.

note: setup time for demo.
OCIds - CSVs in a github repo. 3 different parts: details of spec itself, folks who work on it to validate, and data set itself.

Why did you choose this architecture or process. Were others tried, etc (after the "data aggregation/distribution" question)

it's evolved over time, dashboard wasn't around at the beginning of the project. Other ideas: allowing states to create flat file / csv which we would process into xml. Most work in getting it out of systems and into xml. Unique for each state- sometimes disparate systems, other cases it's in a central system.

What are the political and organizational dynamics of collecting this data?

some pushback: concerned about having data in 3rd party tools that they can't correct. Should be mechanism to approve / correct data, but not always clear to counties.

Who were the relevant stakeholders for this project, how were they identified and convened?

sunlight diaspora, put together a convening every year of ballot community, also working groups that meet quarterly. For VIPs- people providing data (states), people integrating / using API (campaigns, large tech companies).

Is there anyone else I should speak with to better understand X?

some folks, hopefully will get an introduction

Research: Google Transit Feed Specification

Note transit feed spec is not machine readable. Just thorough description of files & data required.

As a user of the API I want the full set of validation checks written, and if possible more legible column labels.

User story

As a user of the API I want the full set of validation checks written, and if possible more legible column labels.
Complete SAS to SQL translation (Whitney)
Change column labels, if needed (depends on ability to accommodate non-ordered JSON values?) (Joe)

Acceptance criteria

SAS to SQL translation of validation rules is complete
Change to column labels complete

Data Federation Spectrum

Schema Representation:

Data Dictionary
Official, machine readable Schema
Open Standard

Openness of Data:

open final data
open aggregated raw data
open distributed raw data

Genesis:

Statutory requirement
Policy
3rd Party
Practitioner

Ownership / Maintenance:

Gov - Funded
Gov - Unfunded, centralized
Gov - Unfunded, decentralized
Corporate For-Profit
Corporate Non-Profit
Standards Body

Create interview guide for Kansas

User story

As a team member, I want to understand the current form 742 process for the State of Kansas so that I can understand how the data ingest tool will improve their experience.

Acceptance criteria

Schedule follow up interviews with Kansas
Prepare data ingest tool demo
Prepare draft listing of questions

As a member of the USDA team, I want enhanced validation rules to I can check results against a csv file.

User story

Set up the validation rules for the USDA.

Acceptance criteria

Interview: Mark Headd

Introduction to the project
Notes will be public
Tyler Kleycamp https://github.com/OpenDataCT/state-federal-datasets
State of New York has formal program in place for city & state to submit data to open data repository. State's data cleansing process too heavy a lift.
County level / health inspections etc. do a great job submitting their data. Incentives align - don't have to publish their own data. Data format mandated by state. Andrew Nicklin, worked at center for gov excellence at Johns Hopkins.
Philadelphia CDO

Interview Template

Introductory comments

What is the data federation effort all about? What am I looking to get out of it?
This is a collaborative research project with GSA's Office of Products & Platforms and 18F. The goal is to build a toolkit / playbook for undertaking intra-governmental data collection / aggregation projects, such as data.gov, code.gov, and NIEM. Or goal is to find out what works, what doesn't, and what tools are appropriate for what circumstances, in order to accelerate similar efforts in the future.
Notes & Report will be public
Any questions before we get started?

What is X, in your own words?

What was impetus or driving force for this effort: policy, user needs, etc (perhaps after the first question)

In building X, what were the biggest challenges, and what went smoothly?

What tools and technologies do you use for this effort?

Why did you choose this architecture or process. Were others tried, etc (after the "data aggregation/distribution" question)

What are the political and organizational dynamics of collecting this data?

Who were the relevant stakeholders for this project, how were they identified and convened?

Is there anyone else I should speak with to better understand X?

As a developer I want a set of tests so I can check my work as I go.

User story

As a developer I want a set of tests so I can check my work as I go.

Acceptance criteria

Establish test suite for django-data-ingest (Joe)
Doesn’t appear to be one now, if implementation / features are necessary, stuff like this would be really helpful.

code.gov interview

[ First a reminder: Notes & Report will be public ]
[ Second, what is this effort all about? What am I looking to get out of it? ]
[ Third, any questions before we get started? ]

What is X, in your own words?

Code.gov is the central place to see America's code. It is also the official public implementation of the Federal Source Code Policy.

What was impetus or driving force for this effort: policy, user needs, etc (perhaps after the first question)

"The impetus to create code.gov derives from policy." now the project has evolved, and there are a whole bunch of other reasons why, but policy has always been the main focus, and how to stay true to that policy.

in building code.gov, what were the biggest challenges, and what went smoothly?

At the start, the biggest hurdle was actually getting agencies to come on board. Actually getting them to fill the code.json, helping them out.. explaining the policy, getting them to understand the requirements. Getting agencies to understand we're here to help. Project was always meant to follow a lean / agile management style. Challenge: How to setup code.gov as quickly as possible while still serving users & policy goals? Scoured github for open source code for agencies. Then: how do we evolve what was already done, in the quickest possible way within the same constraints, within the government framework (ATO, technology, etc)? What compromises should we make to get this out the door as quick as possible, but still have it be what our clients need and what policy dictates? e.g., need to expand beyond static site, so we need infrastructure. Communicating through github, building community is a current priority, getting buy-in / ownership from agencies as well as public. What went well? The initial site (from 1 year ago), got it pretty well for a first try. It was a site they put out in a hurry, very much an MVP, no idea what the community response would be. "[just] ship it!" When they went out to oscon / put it on hackernews, had 100k views. NB: it was on federalist. Good infrastructure decision to start with static site. Shipped before full compliance. Gave them a sense of who the audience was going to be. got feedback from policy / community. Another challenge: from an agency perspective, communicating benefits / incentives for agencies is a challenge. (e.g., why should we do this, our stuff is already on github? Or, why open source our code?) Challenge is to get agency code in one place. Challenge at the program level, not always the product / website level. "not so much the technology, it's the people / process."

Could you describe the data aggregation and distribution processes of code.gov?

"code.gov is a huge ETL project" policy actually defines that agencies must have a code.json on their site (most have in their root directory). Code.gov fetches all the code.json of all the participating agencies, transforms it into ways it can be served on the site." "Search is an important thing for code.gov.. that's why we load everything into elasticsearch." Right now, loading everything into static site. Code.gov is only distribution model.

Why did you choose this architecture or process. Were others tried, etc (after the "data aggregation/distribution" question)

[skipping]

What are the political and organizational dynamics of collecting this data?

"From what I've seen, there's been no need to strong-arm anybody into this policy" "No need to make a huge law.. policy speaks for itself." "Major issue, as with everything that relates to people, is effort and time" "[agencies] may not actively maintain it." "Challenge is to keep this in mind as something that is relevant to something that they should keep doing" "Dynamic hasn't been very difficult talking to agencies." "What we're seeing across government is that it's split into thirds, 1/3 that has no policies relevant, 1/3 that have portions, 1/3 that already have full policy". Some folks nervous about making anything public. Would be helpful to get into a law long-term. The more we can get it solidified into law [the better]. Really helpful to have a few shining examples you can show other agencies as an example. "Look at [nasa , GSA, etc], clone these repos, take these steps etc."

Who were the relevant stakeholders for this project, how were they identified and convened?

"Stakeholders range from government agencies to the public. Those interested in government code."

Is there anyone else I should speak with to better understand X?

Tyler Kleykamp Interview

Introductory comments

What is the data federation effort all about? What am I looking to get out of it?
This is a collaborative research project with GSA's Office of Products & Platforms and 18F. The goal is to build a toolkit / playbook for undertaking intra-governmental data collection / aggregation projects, such as data.gov, code.gov, and NIEM, where data is collected from entities over which you do not have direct authority. We call these federated data efforts. Or goal is to find out what works, what doesn't, and what tools are appropriate for what circumstances, in order to accelerate similar efforts in the future.
Notes & Report will be public
Any questions before we get started?

What experience do you have that might be relevant to this effort? (e.g., work with open data standards, participation in gov data collection across organizational boundaries, etc.)

CDO for state of Connecticut
1/2 time running open data program, 1/2 time ways to better use data. Works in office of policy and management (equivalent of OMB). Doesn't work in IT agency or report to CIO. Other state CDOs more focused on data warehouses etc., Tyler focuses more on Analytics. They collect a lot of data from municipalities. They pull data from 169 individual municipalities. Property tax data, municipal spending, etc.. 1-2 years ago became interested in state-federal relationship. data collected from municipalities.

"probably a better way to do this, don't know what that better way is"
"pretty interested in this flow of data from the state to the federal government."

previous: recovery act, federal agencies pump more to state programs.

Data collected from municipalities:

challenges / what went well: real estate report, aggregated from municipality. getting data is easy. 10 categories: residential, commercial, industrial, vacant, etc. One town may consider apartment building as residential, another considers them apartments. Still have paper based reporting systems - small towns don't have full time person working there.

uniform crime reports example- some departments still using triplicate carbon paper.

What was impetus or driving force for this effort: policy, user needs, etc (perhaps after the first question)

90% of the time, state or federal law. Mandate reporting of data to states from municipalities. Some cities struggling, some effort to monitor fiscal health of cities. Requested data from towns, got some resistance. Implemented law that would allow us to get timely information. Big problem: reporting only at end of fiscal year, then data is not timely enough to be useful.

In building X, what were the biggest challenges, and what went smoothly?

smoothly: financial incentives or support work. example: provided grants to get municipalities to get off quickbooks, got great compliance rates. Another example: grants to develop parcel data, if you take the money you must use our data and give us the data back.

for reporting data up to feds:
"as far as I can tell it's a fairly smooth process" . Usually federal software provided & funded. For example - state drinking water. Use fed provided software as database. But it's "trapped" in federal systems. Hypothesis: if we had a better way to make the data more readily available as open data, or access-controlled tables on web.

What tools and technologies do you use for this effort?

"it's all over the place" - you can email, there's a "portal" to upload. More sophisticated systems where you upload CSV / excel. Standardization - either spelled out in statute or (in more successful models) requires agency to come up with standard, usually focused on ontology, usually done through considerable stakeholder engagement. Iterative process. Draft / feedback / etc.

QA / QC after aggregation. Compare to year before, is something way different? Lots of knowledgeable people, things jump out at them. Some work before it's suitable for analysis. Vast majority is for mandated annual report. More recently goes into dashboard etc online.

Why did you choose this architecture or process. Were others tried, etc (after the "data aggregation/distribution" question)

What are the political and organizational dynamics of collecting this data?

broadly, a lot of the people that are ultimately responsible for submitted / collecting data are not technologists, not thinking about other uses for the data might be out there, getting people to think outside the bounds of the exact report they're creating is challenging. Often there's a mandate, but no carrots or sticks. A lot of reluctant participation. Often done begrudgingly.

Who were the relevant stakeholders for this project, how were they identified and convened?

Is there anyone else I should speak with to better understand X?

possibly other state CDOs
Michigan has a recent law to collect data from municipalities.

What efforts are you aware of that fit this category?

Do you have contacts in those efforts who we could reach out to?

What do you think are the primary challenges of these types of efforts?

Would an open source toolkit help?

"No shortage of tech based tools where you can load a spreadsheet in this column aligns with this column in this database" (e.g., basic ETL). Perhaps process for uploading & running basic validations. If some of the data were just more available, that might help.

Interview: Wo Chang / NIST

Introductory comments

What is the data federation effort all about? What am I looking to get out of it?
This is a collaborative research project with GSA's Office of Products & Platforms and 18F. The goal is to build a toolkit / playbook for undertaking intra-governmental data collection / aggregation projects, such as data.gov, code.gov, and NIEM, where data is collected from entities over which you do not have direct authority. We call these federated data efforts. Or goal is to find out what works, what doesn't, and what tools are appropriate for what circumstances, in order to accelerate similar efforts in the future.
Notes & Report will be public
Any questions before we get started?

What experience do you have that might be relevant to this effort? (e.g., work with open data standards, participation in gov data collection across organizational boundaries, etc.)

A few years back, was struggling with taking different kinds of data sets and how to combine them to make better decisions. how to provide an architecture for data sharing? Preferably machine-accelerated sharing, to avoid expensive emailing & human interaction. SC32 working group 2, working on metadata model. Resolved to work on a federated data type registry. Currently problem: how to define the structure? Working with data type registry working group. PID tech (e.g., DOI, etc). Note, not much documentation at this point, though there are some presentations.

What do you think are the primary challenges for the federated data type registry?

2 aspects - how do you map concept (e.g., geo information) to structure? How do you combine data model? How do I know those are same concepts but different data model?

As a user of the API I want the ability to use non-ordered JSON values and / or send one record at a time without headers.

User story

As a user of the API I want the ability to use non-ordered JSON values and / or send one record at a time without headers.

Acceptance criteria

Records can be sent in different orders
Records can be send without headers

As a team member, I want to start the Round III funding presentation.

User story

As a team member, I want to start the Round III funding presentation.

Acceptance criteria

Confirm the 3-4 SMEs (Chris). Likely 1 from USDA, 1 from OMB, would still need 1-2 others?
Add folks to our demo on June 27th
Get any necessary screenshots / content for slide deck (Chris)
Distribute to team, and prep for dry run

As a partner, I want to send uploaded data to a RESTful web service

User story

As a partner, I want to send uploaded data to a RESTful web service so that I can interface to API-based data systems.

Acceptance criteria

System supports inserting records by making webservice API calls
Documentation describes how to set up web service as destination of insert

As a user, I want updated content on federation.data.gov so that it more accurately reflects the research and resources to date.

User story

As a user, I want updated content on federation.data.gov so that it more accurately reflects the research and resources to date.

Acceptance criteria

Process for site updates identified, and deadlines established for first pass
New content is reviewed and pushed to site
Site includes "one-pager content" that an agency can review for input / feedback
Initial content audit is complete and shared with the team
Make voice and tone consistent, and appropriate for targeted personas

Interview: the Open Data Institute

Standards project at ODI

started 3 year project
looking at publishing, standards, how to improve gov service, first set of projects in flight and completing in March 2018
Way that they're approaching work around standards: how do people approach working with Standards? Issues: people not aware of how to create standards, don't know what standards are available, create standards that don't get adopted. They're doing user research, interviewing a variety of people involved in standards development or adoption, or tech folks engaging in standards processes. Desk research: looking at processes by which successful standards have been developed. Capture what common elements are around standards process. What is typical? Who should be involved at each stage? Working with 4 orgs to produce some research on their own: W3C (surveying of existing communities), OpenNorth (perspective on developing civic tech standards), OpenDataServices (OpenReferral), Porism (software firm works closely with local gov association). Each will be publishing at end of month. Aim: synthesize these into guidance / guide book would help people understand what robust standards process looks like. What does good process look like for engaging communities? Recommended tooling? Understanding - when do you need a standard? When to do specs vs developer docs?
GDS Standards Group.
Think about "layers" of standards (analogy: 7 layer networking)
conformity assessment vs machine readable standards
in workshops with different groups, to focus on standard elements. Surprise: issue of sustainability. Successful standards will evolve. "Everyone seems to be wrestling with"
"It's infrastructure work, and nobody wants to fund infrastructure". Phil: folks don't understand how critical standards are to society working. "Data infrastructure is infrastructure for modern society." Another challenge: measuring adoption & success. Did research, couldn't find anything to track the success of standards. How do you surface use of open data?
People say they want standards, but doing standards too early can be detrimental. When are loose conventions enough, vs official standards?
British Standards Institute & Geospatial want to connect more with bottom-up. Phil: knee-jerk reaction, people have had bad experiences with standards bodies. Involve too many people too early, too bureaucratic. "It's a gearing problem" - startups want to move fast, solve the problem now.

As a partner, I want documentation on generating alternate flat-file destination formats

User story

As a partner, I want documentation on generating alternate flat-file destination formats so that I can adapt the system to my specific needs.

Flat-file dumpers are registered in Ingestor.inserters; currently only json has been written. A partner wanting to generate flatfiles in an unsupported format would have to write an Ingestor subclass with its own insert_ method, include the method in the subclass's inserters, and set DATA_INGEST['DESTINATION_FORMAT'] and DATA_INGEST['INGESTOR'].

Acceptance criteria

Documentation exists
Example project using it exists

As a stakeholder, I want to understand who the user is so that I can understand how they'll use the tool.

Goal

Let’s determine who the (meta) user is, and what this will do for them. We have a lot of various users for the work and the report, which is important, but also for the libraries and code. What are the personas for this work, etc.? How do we market / share this information out for people to share with their managers? (Chris)

Acceptance criteria

Read Open Data Foundation’s report - they have clear user personas set up well
Do a quick murally synthesis on the old user interviews
Summarize the current state of knowledge around the current users and who they might be
Set up the initial (maybe top three) personas for this project (who are they? What are they looking for?)

As a user, I want to have simple way to post and store data that meets minimum criteria.

Goal

Goal: get to the point where we have some small, simple ingest mechanism - something that is in a library, built in a modular way, something that could be used to ingest data (Catherine to lead)

Acceptance criteria

Determine generic use case, e.g consuming simple csv files
Determine libraries

Implement basic upload workflow:

Create interview guide for Janis (FNS)

As a designer, I want to thoroughly understand the current 742 form and data submission process, to that I can better prepare for interviews with states.

Acceptance criteria

Review document provided by FNS (Janis).
Create and refine list of questions for follow up discussion with Janis.

Interview: Rachel Bloom

What experience do you have that might be relevant to this effort? (e.g., work with open data standards, participation in gov data collection across organizational boundaries, etc.)

[previous research arose from] geothink, work done as undergrad @ McGill at Geothink . Had a Geothink partner city interested in knowing what standards exist to publish open data. Started out making catalog of standards that existed for such data (had to be open & for specified domains). City wasn't able to disentangle important from non-important info. Then the idea came up to create a dashboard. Made an attempt and what emerged was a systematic set of metrics. open data standards directory is name of dashboard and published by geothink and center for government excellence. Evaluating different kind of open data standards for specific "high value" types of open data.

Also did own work as a student, interviewing preferences for open data standards and types of knowledge about data standards. One finding is that people have different definitions for talking about standards. Interviewed 13 cities in U.S. Mostly open data leads & some other titles. Now works at OpenNorth (montreal) non profit that specializes in civic tech & open data. Now mapping openness onto smart city development. Asks the question: what does it mean to be an open smart city?

What works?

"It's really hard" to answer because different objectives among different actors. Metrics for interoperability of data for municipal data encode a lot of what is good / promoted in these effort. Check it out! metrics themselves bring to light key things that open data standards should incorporate into their development.

What efforts are you aware of that fit this category?

Started at open data standards directory. I evaluated 27 standards, but now over 60 now in the directory.

Do you have contacts in those efforts who we could reach out to?

skipped

Are data standards a good idea?

I like this question because part of this type of research is innovation bias (i.e., people expect that others want to adopt new stuff). But is it actually a priority? Is there a perceived need / benefit by potential adopters? Question about whether standards are good is better suited for adopters because these standards will serve their larger objective.

"I clearly think think they're worthwhile." Any approach that makes data more open / transparent / improves access and ease of sharing information with general public that doesn't compromise privacy/security is good to me. More tools required? "I think that's very advanced for the work I've been seeing"

What do you think are the primary challenges of these types of efforts?

interviewed open data challenges for adopters at cities - time & resource constraints were the primary cited. Interoperability "is even more than a technical obstacle but is a coordination issue as well." Interoperability = seamless transfer of data and information. Do specifically look at data that tries to open the data as much as possible. They just need time to think critically about this kind of stuff. Even if you do provide resources, there needs to be a learning process. People need time to reflect and be informed and to troubleshoot the problems in a critical way. More time to educate themselves.

Interviewees typically decision makers in publishing open data & interested in data standards. Usually managers / project coordinators, and less interviewees were involved in technical details of adopting standards.

data.gov interview

[ First a reminder: Notes & Report will be public ]

What is data.gov, in your own words?

It's the federal government open data site, in the name it's a meta data catalog to make it easy to discover and use data made open by agencies of the federal government (note: includes some non-open data and state & local). Also the manifestation of an internal gov-wide open data policy.

What was impetus or driving force for this effort: policy, user needs, etc (perhaps after the first question)

Geospatial efforts (geoplatform 1) had been an effort to make gov wide catalog of geospatial efforts, thinks it goes back to OMB 16, from effort to coordinate geospatial data from wartime. Existed before 2009. Before 2009 there were agencies that had data on websites. Presidential memo on openness and transparency in government (broader than just data), also OMB policy on open government that required the standup of a data site, stood up in a couple months in May 2009. Original policy required 3 data sets, received datasets and put in first version. From 2009 - 2013 there was steady growth in data.gov, many initiatives blended into data.gov. Continual effort with monthly working group. Many open data efforts included clauses to include data in data.gov. 2013 memo to include open data w/ accompanying OMB policy, states that agencies maintain open data catalog in standard file format on their server. Before 2013 they were operating as a centralized platform. Allows agencies to take advantage of inventories they already have, just automatically reformat into proper metadata schema. Allows agencies to integrate their inventories, acknowledging that there were a lot of publishing efforts going on, just not a great way to catalog. Diversity in capabilities prior to effort.

in building current data.gov, what were the biggest challenges, and what went smoothly?

"took a while for agencies to get on board as to whether it was a good thing" "Some agencies by nature are going to be more hesitant about participating in something that means releasing more information" "Some risk aversion" concern that they will be exposed to people finding fault with data. Concern about judgement from OMB. Some agencies it was a challenge to find tools & resources to comply with this. Made inventory.data.gov in order to manage metadata, add / remove datasets etc. Very helpful for agencies that didn't have an inventory service already. Transition to new policy and new architecture took some time, there were some rough spots. e.g., making sure schema was very well documented, mechanisms to validate. Initial schema was designed to align with international schema standard, second iteration of refinement implemented final standard. Had to take an iterative design ethos- "just do it" then iron out the kinks. How to set expectations? some highlights - no change management timeline around schema, maybe could have been more communication around changes. should have been an official beta.

Could you describe the data aggregation and distribution processes of data.gov?

For larger agencies, they work on this as part of the OMB policy they are required to follow. They are part of the open data community, people are responsible at agencies have regular meetings with OMB & one another. Have community discussing the nuts & bolts. For smaller agencies not required, they have been opting in and following schema. To a certain extend non-federal also provide minimal essential data (~50 participants). Policy asks agencies to inventory all data sets, but is pretty broad, includes non-public and restricted, inventory must follow metadata schema documented on project open data, put it all on data.json file which is hosted on website. Each endpoint is registered on data.gov, harvester checks every 24 hours, if changed then it processes and adds / removes / updates entries. btw, there is a parallel process for geospatial standards. Originally more distinction between public and private datasets.

Why did you choose this architecture or process. Were others tried, etc (after the "data aggregation/distribution" question)

some things came up - metadata vs hosting data. not really set up to host lots of data for agencies on their behalf. policy & process requires enterprise wide inventory, sometimes smaller entities (e.g. departments) want to publish their data. big departments with independent fiefdoms. easier for some than others. Sometimes easier for agencies that don't have a big history / legacy systems. Modernizing harder.

What are the political and organizational dynamics of collecting this data?

[ skipped, already covered, low on time ]

Who were the relevant stakeholders for this project, how were they identified and convened?

try to have minimum number of people from each department.

Is there anyone else I should speak with to better understand data.gov?

might be nice to hear from an agency lead. might be nice to talk with federal geospatial group.

As a policy maker, I want to understand what the current policies / laws are and how this project applies to them.

User story

As a policy maker, I want to understand what the current policies / laws are and how this project applies to them.

Plan for policy workstream: Who to talk with? What documents to review?
An audit of the current policies / laws (and who will do this?)

Acceptance criteria

Audit of current policies / laws completed

Interview Template: Person of Interest

Introductory comments

What is the data federation effort all about? What am I looking to get out of it?
This is a collaborative research project with GSA's Office of Products & Platforms and 18F. The goal is to build a toolkit / playbook for undertaking intra-governmental data collection / aggregation projects, such as data.gov, code.gov, and NIEM, where data is collected from entities over which you do not have direct authority. We call these federated data efforts. Or goal is to find out what works, what doesn't, and what tools are appropriate for what circumstances, in order to accelerate similar efforts in the future.
Notes & Report will be public
Any questions before we get started?

What experience do you have that might be relevant to this effort? (e.g., work with open data standards, participation in gov data collection across organizational boundaries, etc.)

What efforts are you aware of that fit this category?

Do you have contacts in those efforts who we could reach out to?

What do you think are the primary challenges of these types of efforts?

As a team member, I want a couple good use cases we can use to better understand how the ingest mechanism might work and who would benefit from it.

Goal

Let’s identify a few top use case candidates that we can work with. Two stages - first, let’s figure out the thing we’re going to work on while we line up the ideal use case. The program code thing is basically ready to go, and we do have historical data from that, so we’d have no dependencies there. We could go with that initially as Kathryn’s loading it in the ingest mechanism. Or, we could just build the mechanism.

Acceptance criteria

Keep on top of meetings / email chains (currently ongoing threads with OMB / DHS / USDA)
We've decided to keep it free- running until we get something juicier to work with.
Select the first use case!
Set up a working session to pick first use case! (Chris to schedule)

Once use case is selected

Make sure we communicate this to the partner
Get relevant documents / data / process established with the partner
Identify the first “prototype” thing we want to build / test.
Keep on top of meetings / email chains (currently ongoing threads)

Finally, track burn rates to make sure we’ll do what we need to do

Weekly burn rate updates from 10x

Interview: data.gov.ie

Introductory comments

What is the data federation effort all about? What am I looking to get out of it?
This is a collaborative research project with GSA's Office of Products & Platforms and 18F. The goal is to build a toolkit / playbook for undertaking intra-governmental data collection / aggregation projects, such as data.gov, code.gov, and NIEM. Or goal is to find out what works, what doesn't, and what tools are appropriate for what circumstances, in order to accelerate similar efforts in the future.
Notes & Report will be public
Any questions before we get started?

What is data.gov.ie in your own words?

it's the national open data portal for irish government, provides access point
5.5k datasets from across publishers (doesn't host or store)
a lot of different harvesters
federated approach

What was impetus or driving force for this effort: policy, user needs, etc (perhaps after the first question)

around 2009 / 10 / 11, open data coined. Deirdre working at research institute, but a lot of community / citizen / local government sees it as very important. In 2013/14 at a national level the gov as part of open gov partnership committed to national data portal. Since 2014, national open data portal policy. They worked to document what is open data, what is important. From 2014 it was making more data available. Then different orgs came onboard. What was defined early on was public bodies working group. A lot of geospatial / stats community. Came together to define standards / documents. Available in references section of data.gov.ie. Technical framework document defined. Over last 2-3 years, more awareness of open data, emerges as key element of lots of other open policies. Open data governance board established, made up of reps outside of gov, provides guidance & advisory role for open data as well. Recently ranked top in europe for open data.

In building open data portal, what were the biggest challenges, and what went smoothly?

there has been an evolution of the overall initiative. Could we have fast-forwarded that evolution? (unlikely) need to grow understanding as a country, as a government. When portal was first launched, used ckan, couple of datasets lying around. Next big step was drupal / ckan. Past couple years there's been a lot of customization. Focus on 13/14/15 was raising capacity within the public sector themselves. 3-4 ppl are the "open data unit". From tech level- revamping portal again, continuing to add different features. Definitely emphasis on quality, real time data, APIs now. from actual site. Last few months: few high profile real time data available as APIs. e.g., bills / policies api.

What tools and technologies do you use for this effort?

d-cat. w3c recommended. It has become the defacto metadata standard for open data catalogs. d-cat application protocol. +dcat-ap. ckan -> drupal+ckan -> updating to current core ckan, more efficient backend.

Why did you choose this architecture or process. Were others tried, etc (after the "data aggregation/distribution" question)

it's a hybrid - manually update via user interface, api upload, 3rd is via harvesters. Now harvest from 14 different sources, based on arcGIS system. Pull in from data sources that are already there. Recommended if the data volume is large. Also (4), data audit tool, used for pre-publication stage.
audit tool to build data. Also asks questions - should they publish it or not?

visualizations- for catalog itself, there are previews for all different data types. Lot of data in json-stash stmx data, format (statistical data). More to give user impression, there's lot of data here, explore!

What are the political and organizational dynamics of collecting this data?

"very much the role of the open data unit" -> would work closely with open data bodies, explaining what is involved, disseminating awareness. From their side, make it as easy to publish as possible. Make it easy for the public body to publish. Supporting & guiding. Another thing is to show examples and use cases, show how other public bodies do it. Public bodies working group. Focus around the impact of open data. "The more stories, examples, use cases, the better" started open data impact series. Discuss open data impact in panels. Large banks, insurance companies, etc., see how it can have an impact in decision making and processes or in opening markets. "As a community internationally, we need to be doing wider studies of impact of open data."

Who were the relevant stakeholders for this project, how were they identified and convened?

Is there anyone else I should speak with to better understand X?

Who else is working on tooling?

portal at datadoor
insight center
vienna, open data portal watch
European data portal
group in Belfast

Open311 Interview

Introductory comments

What is the data federation effort all about? What am I looking to get out of it?
This is a collaborative research project with GSA's Office of Products & Platforms and 18F. The goal is to build a toolkit / playbook for undertaking intra-governmental data collection / aggregation projects, such as data.gov, code.gov, and NIEM. Or goal is to find out what works, what doesn't, and what tools are appropriate for what circumstances, in order to accelerate similar efforts in the future.
Notes & Report will be public
Any questions before we get started?

What is Open311, in your own words?

[pull up site?] Open311 is a set of standardized APIs for allowing the public to use different tools / technologies to interact with their governments through a consolidated contact center. The same way 911 or 311 shortcodes have become standards, these APIs have been used in a similar way. A lot of 911 centers were getting numbers for non-emergency things- potholes, garbage, questions, etc. Baltimore & NY started. Standardize numbers & processes. Unifying support line for government, also standardized collection for needs of citizens. 2008 / 09 number of apps emerge to give 311 functionality in apps. "Fix My Street" app. Having the website that let people do it, transparency of it, maps, in app was a great model, got picked up in US as well. Some were coordinating with gov, some were going rogue. Rogue apps deceptive - users think they are submitting to gov, but aren't, or aren't consistently. Also app proliferation causes confusion - which to use? In 2009 local gov of DC was doing open data & app challenges ("Apps for Democracy"), for another round they focused on 311 service. Built read / write API, built prototype API. Similar conversations in NYC. OpenPlans suggests helping DC think of their work as a pilot. Apps built around that - 3rd parts apps create better way to make issues for cities, those apps got attention of more cities. SF. Cities had different model of how they operate. Also Miami. SF brought a lot more attention to the effort, captured more people's imagination. Folks started iterating on another version (in 2010). What the API does- let's people report non-emergency issues to city government. You could open up an app, perhaps generic or perhaps pre-filled for a specific purpose. Let's you fill it in, take picture, submit it, get tracking number. you can query status of your issue, other issues. Boston joins as well in 2010, as well as startups & major companies. Spring 2011 spec was published, got pretty wide adoption both in & outside of US (also Canada / European cities), a few dozen cities worldwide.

What was impetus or driving force for this effort: policy, user needs, etc (perhaps after the first question)

[covered in previous]

In building open311, what were the biggest challenges, and what went smoothly?

"One thing that went smoothly that I would normally expect to be a challenge - somewhat coincidental interest in ... having an API for these systems." Open311 model already in place, just translating telephone conversation to apps. Things well aligned - already a lot of process behind the scenes. Willing guinea pig at the same time as there's this interest bubbling up. Other cities add to momentum. "Technical and process challenges going from first city to second city" "Second iteration didn't really work for both cities" [Different cities have different operating models] [Learned that there wasn't more variety than those two cities] Open plans was in a position to act as a neutral 3rd party / standards body. For past couple years, hasn't been a strong stewardship role to push the work forward. "Still exists a significant need for someone to be spending more time coordinating and stewarding... addressing issues with existing specification". 2 categories, reporting problems & searching knowledge base, kb spec needs more effort. Interest & need identified. fed gov contact center / usa.gov and 211 (social services) efforts also emerging. "OpenReferral effort"

What tools and technologies do you use for this effort?

The original spec (v2) just documented on a wiki / html page, layed out different rest request / response parameters. Wasn't a machine readable version of a schema / api spec for several years, but recently there is an effort to put together the openAPI spec / JSON schema. Other efforts on creating validator / test suites that can be run locally, non completed fully. "Probably a half dozen or so language-specific client libraries for the spec". YAML file on website to update known endpoints for the spec. Also a number of open source server / client implementations, but not have been fully vetted as being 100% implementation of the spec. Minor inconsistencies with implementation. OpenAPI spec effort has brought to light more attention to detail. Multilingual support not fully considered early on.

Why did you choose this architecture or process. Were others tried, etc (after the "data aggregation/distribution" question)

[already covered]

What are the political and organizational dynamics of collecting this data?

"Numerous conversation about adopting this standard / model in the fed government... the challenge [is to consolidate the contact centers / CRMs across gov]."

Who were the relevant stakeholders for this project, how were they identified and convened?

Andrew Niclin (sp?), Open Referral effort.

Is there anyone else I should speak with to better understand X?

Other comment -
Chicago biggest city to implement open311, with support from CfA. CfA surge, open plans surge, but interest waning, momentum waning. Figuring out- what is the organization structure of a body that sustains this type of effort? A vendor? Standards body? Local Gov? Federal Gov? Also ongoing opportunity to support disasters? (WH effort)

As a team member, I want to understand the meta project management needs to I can balance my work accordingly.

User story

Figure out our meta project management approaches, kanban / task assignments, tracking, weekly reporting, etc.

Acceptance criteria

Set up tracking board
Articulate more detail around the final deliverables for this project
Timeline
Build out the cards / user stories
Set up templates for weekly reports and retros
Make sure all aligns to how the team wants to work

NIEM: Interview

Introductory comments

What is the data federation effort all about? What am I looking to get out of it?
This is a collaborative research project with GSA's Office of Products & Platforms and 18F. The goal is to build a toolkit / playbook for undertaking intra-governmental data collection / aggregation projects, such as data.gov, code.gov, and NIEM. Or goal is to find out what works, what doesn't, and what tools are appropriate for what circumstances, in order to accelerate similar efforts in the future.
Notes & Report will be public
Any questions before we get started?

What is NIEM, in your own words?

NIEM is a community driven approach to building consensus and collaboration to develop a shared solution to a shared and common problem. To do that they share a common vocabulary, the NIEM data model. "A big piece of the NIEM is that it is a concretely defined machine readable set of artifacts, built in a collaborative and rigorously versioned process" Governed by NIEM participants, domains extend core concepts from other domains.

What was impetus or driving force for this effort: policy, user needs, etc (perhaps after the first question)

I think the genesis of the program was a lot of what [I] stated in the US data federation description]" decentralized systems proliferated. U.S. Criminal Justice system very decentralized. NIEM & it's precursor work was about how do you reach consensus / common vocabulary across these decentralized, disparate partners? Add in more agencies, NIEM really shines.

In building NIEM, what were the biggest challenges, and what went smoothly?

"A lot of the challenge around NIEM- a lot of misunderstanding about what NIEM is and what it's designed to do... a lot of existing domains already have existing standards, see NIEM as a competitor." NIEM really works around inter-domain operability. "Another challenge: political challenge of "I didn't invent it, so I'm not going to support it". One of the successes: "the coalition of the willing - orgs came together and decided we needed a common way to communicate. "As an organization matures, they come to realize they need something like NIEM" "Pretty inexpensive project, but it's been difficult to get funding for it, because it's something that we're giving away... often don't know who is using it." "People are often more willing to build a new thing, rather than to improve something that already exists to meet their requirements." "It has been a real catalyst for bringing communities together to talk about their data sharing needs" in contrast to mandate / policy etc, this is a "grass roots" effort. Positive: publishing regular releases, "NIEM looks the way it does because it supports the governance model" The actual governance model: 2 committees, NIEM technical architecture committee manages specs & tools that publish the content as reusable schemas and NIEM business architecture committee. Domains exist separately, managed separately manage their own schemas that build on & extend the core schemas in an interoperable way.

What tools and technologies do you use for this effort?

At it's core, NIEM is a set of XML schema documents ("NIEM releases"). Every year we update the domains & we synchronize those domains with each other. Every 3rd year they we do a "major release" - update domains & core. Also update some supporting schemas upon which the technical specifications are based. Github repository for NIEM releases. also a tool that supports users with specific domains. schema subset generation tool (the "grocery list" schema recipe maker), you can add additional definitions. Additionally they publish spreadsheets. Also spinning up a json version, json-LD for web-friendly json. "When somebody's building a message, they need to be able [to customize]" - thus they introduce Naming & Design Rules for how to extend NIEM concepts in a schema. Those roles can be validated / checked by Contesa (conformance test assistant). Also developed and launched an open source tool for searching & navigating what is in name - search by core, domain, keyword. Working on JSON export (interested in feedback!).

Why did you choose this architecture or process. Were others tried, etc (after the "data aggregation/distribution" question)

NIEM actually developed out of the global justice XML data model (DJXDM), brought out of DOJ program "global", needed to setup interoperability with very different types of orgs. Started with data dictionary, but that lacked machine-readable interoperability. After 2001, NIEM extended DJXDM across many domains. Knew we needed interoperability, but also independence. XML was popular at the time - XML schema was becoming commodity software. XML Schema provides really good, reusable data definitions. XML schema really supports inheritance, really good for this application. "If not XML schema, then what?" "Ontology standards very powerful" "UML... tends to involve a focus on a specific tool"

What are the political and organizational dynamics of getting NIEM setup?

"One of the key challenges: who are the authoritative sources of the data? Who are the key people / organizations we need to go to find out what a word means? In some cases that doesn't exist." "Getting people comfortable with the idea that we're going to operate by consensus.. all of your voices matter." But how to find the right people who are the right people from the right orgs. "Who owns this data element? Not a question that's always easily answered." "Have to be ok with a little ambiguity" e.g., lots of ways to do locations.

Who were the relevant stakeholders for this project, how were they identified and convened?

Is there anyone else I should speak with to better understand X?

Data Coalition Interview

Introductory comments

What is the data federation effort all about? What am I looking to get out of it?
This is a collaborative research project with GSA's Office of Products & Platforms and 18F. The goal is to build a toolkit / playbook for undertaking intra-governmental data collection / aggregation projects, such as data.gov, code.gov, and NIEM. Or goal is to find out what works, what doesn't, and what tools are appropriate for what circumstances, in order to accelerate similar efforts in the future.
Notes & Report will be public
Any questions before we get started?

What projects are you aware of in the past that might fit into this category?

Standard business reporting project in Australia- regulatory reporting standardization project. Tax minister convened 17 fed & regional agencies for reporting. Involved software industry / vendors from the get-go. Australian Business Register. Standardizing across parts of government that collect similar data from different agencies. (Dutch did something similar) Much more centralized. GLEI has LOUs, approved vendors that will do "vetting" of an entity. Like DNS but with vetting. McKinsey just published.

What projects might be coming down the pike that fit into this category?

HHS compiled all host / award grantee, recognized 20-30% duplicate data elements. Recommended that gov create consolidated taxonomy for grantee. SEC has been doing work expanding XBRL based information collection reports.

What is nice about being Data Coalition-style?

Great advantage - seen as a convener of conversation. People bounce ideas off of them. Disadvantage- sometimes distant from actual work being done. Challenge to be aware of & articulate / defend all the different projects. "It's complicated work and a lot of work is happening" "Very easy for a narrative to build that is counter-productive to the work" "Keep supporting it."

What are the biggest challenges from your perspective in pulling off large, intra-governmental data collection efforts?

"Getting past the ROI question that's always thrown out there as something that can be measured" "What kind of value do you want to put on transparency or easing management friction" "It all comes down to money, for congress especially" "Not anyone's fault, but they need to find something to pay for something" "Should never be an add-on project, should just be part of doing business."

When is a mandate required vs bottom-up?

When there's a clear need within an organization... that's helpful, when there's an impression that this is going to drastically change how this is going to change my job, there needs to be a top-down approach.

As a user, I want documentation so I can understand how to install and use the tool.

User story

See if we can get the site up and running with the existing documentation, and plugging documentation holes where we encounter.

Acceptance criteria

As a USDA team member, I want to identify the test case for the application.

User story

As a USDA team member, I want to identify the test case for the application.

Acceptance criteria

Identify potential State/vendor to test USDA app (Whitney)
USDA reach out to potential partners to assess interest/capability

Present for Open Data BiWeekly

Hey All,

Thanks for having me today. My name's Tony Garvan, I'm the data lead for 18F and recently I've been working on a cool project with Phil Ashlock called the US Data Federation. I'll talk for a few minutes about what it is and how you can help if you're interested.

The US Data Federation is a joint effort between GSA's office of products and platforms and 18F to build a toolkit / playbook for how to best undertake these types of data projects. We call a federated data project one in which you are collecting data from entities that you do not have legal authority over. So, this could be the federal government collecting from states, or states collecting from municipalities, or OMB collecting data from agencies, etc. Any situation where you are trying to collect a certain type of data across complex organizational boundaries, special problems arise, and we're interested in understanding and mitigating those issues. Past projects that fall under this umbrella that we've spoken with are data.gov, code.gov, DATA Act, NIEM, Open311. We also spoke with Connecticut CDO Tyler Kleykamp about his experiences collecting data from municipalities. We ask questions about what was most challenging about these projects, what successes there have been, what the organizational challenges were, what tools and technologies are used, and more. We've already gotten some really interesting themes emerging about how to start, how best to incentivize compliance, and the close relationship between technology and policy. All our deliverables will be public, so we'll share all of our learnings when it's ready.

The first stage of this effort, that we're currently in, is focused on cataloging these efforts and performing in-depth interviews about the challenges and problems of the effort. So, here's the first ask: if you are aware of a project (could be in federal government or elsewhere) that falls into this category, please email me at anthony.garvan at gsa.gov with the name of the project and contact information for a person (could be you) who has first hand experience with it. We'll follow up to find a time to interview you or the person you recommend.

The second ask is that if you do not have necessarily first hand experience with an effort like this, but perhaps have second hand experience observing many of these efforts and are willing to speak, we'd love to interview you. We're really interested in trading notes with anyone who has experience with and opinions about this type of thing.

So, that's all I had for today, do folks have any questions?

As a developer on the team, I want to know who will take over duties so that I can gradually roll off the team as my assignment ends.

User story

Handoff duties - in theory someone that could pick up and continue working with the tools and such things.

Acceptance criteria

DATA Act interview

Introductory comments

What is the data federation effort all about? What am I looking to get out of it?
This is a collaborative research project with GSA's Office of Products & Platforms and 18F. The goal is to build a toolkit / playbook for undertaking intra-governmental data collection / aggregation projects, such as data.gov, code.gov, and NIEM. Or goal is to find out what works, what doesn't, and what tools are appropriate for what circumstances, in order to accelerate similar efforts in the future.
Notes & Report will be public
Any questions before we get started?

What is the Data Act, in your own words?

The data act is a federal law that broadened the scope of what types of federal financial information is public, and also mandated a federal data standard for that information.

What was impetus or driving force for this effort: policy, user needs, etc (perhaps after the first question)

Couple of things: Kaitlin's work at sunlight looking at usa spending shows that data is incomplete, at same time a couple years later the recovery act board had some success tracking finances and uncovered some fraud. Lobbying efforts from tech organizations in data coalition etc.

In building the Data Act, what were the biggest challenges, and what went smoothly?

Biggest challenges were that the law mandated that two different orgs run the implementation - having two owners leads to confusion, different incentives. Shear scope of getting buy-in from federal government over what the data standard should encompass and how the quality of that data could be enforced. Shared understanding - what is an agency? What is an obligation? Also usual challenges - people reluctant to share data and information, agencies used to getting punished if data is published. Also challenge in government - "conflating complexity and magnitude of a policy related task with the amount of technology you need to apply to it" "Just because something is a big job in the federal government doesn't mean you need a lot of technology to do it" "[created] a lot of noise to work through" "Everyone came in with their ideas of what technologies were required to solve the problem without understanding the problem" positives - "We did meet the deadline and requirements of the law" "Treasury went into it with a high level vision of guiding principles" (1) agency-centric, and (2) data must be accurate before it gets delivered to them. Given those principles, took a user-centered and agile approach and worked with agencies to get there. "Tested with real data early and often" "Tremendous amount of outreach" workshops, external stakeholder calls, heavily invested in user research.

Smoothly- overall had really good leadership team on treasury side. A lot of great staff. Got to a really good place with the contractor. Core people generally good. Challenging: time pressure. Tough to get agencies on board and comfortable with agile. Unfunded mandate by congress, made people very combative. People even personally nasty. Tough on morale, people screaming at you. Building a product for all agencies to use is tricky- every agency has their own crazy variant. Proxies make SSL difficult, some have weird versions of IE. "Getting Tim as our early adopter was really good." Agencies began to understand that by being engaged their feedback would be incorporated in a matter of weeks. "Basically no issues [at actual launch due to extensive beta testing]" "Agencies really helped us in a big way by engaging". "Like pulling teeth to get [some external stakeholder group's time for user interviews, e.g., journalists]"

Could you describe the tools and technologies used for aggregation?

Agencies work from a data standard described in 2 ways: XBRL file, and a series of spreadsheets with human readable descriptions. Uploaded file, python / SQL validates in one workflow. Raises warnings vs critical errors. Originally architected as a series of microservices, was a mistake. Really challenging to manage, workflow unmanageable. Premature optimization (i.e., assumption that data couldn't be in memory). Validations - validating the file, validating certain columns against other columns / files.

Why did you choose this architecture or process. Were others tried, etc (after the "data aggregation/distribution" question)

We started with csv in early prototype phase because it was easier to manage. Thought they would "evolve" to json schema etc., but it turns out everyone loves CSVs, we wanted them to be focused on what the primary challenge was - linking award data to contract data. Data is transactional, not hierarchical. Everyone was like "thank god it's just a spreadsheet". Why custom software? Accountability. Ensures specific controls / conformance. Pretty strict - can't overwrite data to force compliance automatically, etc.

What are the political and organizational dynamics of collecting this data?

"There's a lot" "A lot of time pressure" "Having a statutory deadline pt everyone on edge" "Law requires IG to evaluate how they're doing" "Forced 3 data communities to talk with either other in a way they hadn't done before: CFO / contract / award communities. Had very different business processes... everyone had different opinions [about schema]. Forces communities come together. Made agencies submit as one." At higher level, having 2 agencies was "very fraught" a lot of the time. "Really hard to disentangle policy from implementation", but very different in practice. "[some agencies are] particularly eggregious about not wanting to lose control over data." A lot of pushback on removal of obfuscation of data. "This was unfunded but everyone did it." "You need a champion who's willing to incur some political risk to get projects like this done... and an early adopter." "Nobody wants to be first, but nobody wants to be last" "Having one early adopter and having him go out and talk about it did a lot to get agencies on board."

Who were the relevant stakeholders for this project, how were they identified and convened?

OMB and Treasury, each agency has a senior accountable official for data quality (most often CFO). Grant making community (FACE), PCE (procurement community), ACE (financial one). Regularly went to those meetings to present to them. DATA Act inter-agency council. Executive steering committee (OMB & treasury folks). Monthly call w/ people outside of gov, briefed hill staff, GAO audits / treasury IG audits.

Is there anyone else I should speak with to better understand X?

Interview: James McKinney

Introductory comments

What is the data federation effort all about? What am I looking to get out of it?
This is a collaborative research project with GSA's Office of Products & Platforms and 18F. The goal is to build a toolkit / playbook for undertaking intra-governmental data collection / aggregation projects, such as data.gov, code.gov, and NIEM, where data is collected from entities over which you do not have direct authority. We call these federated data efforts. Or goal is to find out what works, what doesn't, and what tools are appropriate for what circumstances, in order to accelerate similar efforts in the future.
Notes & Report will be public
Any questions before we get started?

What experience do you have that might be relevant to this effort? (e.g., work with open data standards, participation in gov data collection across organizational boundaries, etc.)

outside gov: sunlight / Open Contracting Partnership / founded open north. Spectrum of how much effort parties need to do:
scraping and normalizing: better if data doesn't change too frequently, fairly simple, simple meaning. e.g., collecting officials contact info, simple. complex: scraping contracting information.
Slightly more effort: asking them to send you the data. e.g., send us your shape files. they need to (1) find the data, (2) send it to you, (3) give permission to republish.
ask to publish according to a specific format ("getting into realm of standard"), publish on website, send url. In all cases already had open data policies. In terms of encouraging adoption
more formal: standards like open contracting data standard / elected officials data standard / aid transparency. Is it just technical or also policy? what is being measured, how to assess, another simple project is openaddresses.io, but in cases like aid and contracting, "you could get very deep". How to accommodate for differences across jurisdictions / agencies? At that point you want a permanent product team, follow a much more operational approach. For previous levels.
even more formal: legal requirement to publish + centralized operations, vs more decentralized effort.

Why end up in different parts of spectrum?

ambition / comprehensiveness
level of data quality you're targeting
complexity of domain
interacts with policy or purely technical?
centralized approach better for critical top-down initiatives (e.g., DATA Act)

What efforts are you aware of that fit this category?

Do you have contacts in those efforts who we could reach out to?

What do you think are the primary challenges of these types of efforts?

People who receive data get clear benefit, but data owners do not have incentives to change- they already own and benefit from the data, another layer of reporting is only a hassle. Frame it from users perspective / representing interests of a whole group of stakeholders, people more interested. Showed that they weren't going to collect data and do nothing with it: already have an app for it. Before doing broader outreach, identified prominent first adopters. sequence adopters to get people close to them geographically & in terms of city size. For contracting, easier to make argument for shared benefit. Also tracking performance of service provider is helpful. A lot of cases people may benefit, but you still need to make the case well.

in case of contracting, more countries might want to include budgets with that. might be multiple additional stages , committed vs executed. Could model it, but then what if someone else does it differently?

Aggregating data sets- agree on common core, people can include more if they want to. In case of data catalog, that works, but in contracting case, new requirements pose a bigger challenge because they have high standard for interoperability.

Recent challenge- quality or specificity of data dictionary. Field that includes cost value - is that before or after taxes? also interpretability- certain domain specific words are interpreted in different ways (e.g., "tender" translated several ways), high risk in a domain specific field with a lot of jargon.

social challenges- getting people to correctly implement it. Might see standard but don't adhere to it exactly, use it as a guide vs rule. Often happens with bigger orgs, "we're big, we know better". People "just do it there way", challenges to make sure people are implementing it the way it was agreed on. Having more consensus driven approach would be appropriate.

in terms of adoption --> people might see value, but need to justify cost of adoption, might be more on social side than technical side.

also, people seeing it as a compliance exercise, then might not be invested in good data quality. e.g., some department looking like they are procuring a lot of soybeans. Why? It was the highest on the form for a required field.

with complicated standards, might get buy-in at high level, but with implementers it might not get the same framing.

tooling - useful?

data completeness
validators
"we need better tools like that"
a lot of times groups pursue a specific solution because general solutions dont exist

Interview: Andrew Nicklin

Introductory comments

What is the data federation effort all about? What am I looking to get out of it?
This is a collaborative research project with GSA's Office of Products & Platforms and 18F. The goal is to build a toolkit / playbook for undertaking intra-governmental data collection / aggregation projects, such as data.gov, code.gov, and NIEM, where data is collected from entities over which you do not have direct authority. We call these federated data efforts. Or goal is to find out what works, what doesn't, and what tools are appropriate for what circumstances, in order to accelerate similar efforts in the future.
Notes & Report will be public
Any questions before we get started?

What experience do you have that might be relevant to this effort? (e.g., work with open data standards, participation in gov data collection across organizational boundaries, etc.)

"A lot" - worked in NYC gov for a number of years, participated in open311. Also one of early partners in "lives" standard from yelp. Also BLDS building and land development specification. Also library of standard at datastandards.directory. Been involved in direct development in standards, also involved with city gov in adoption.

What efforts are you aware of that fit this category?

60 in datastandards.directory. OpenReferral - very federated.

Do you have contacts in those efforts who we could reach out to?

greg bloom for open referral. pls remind him.

What do you think are the primary challenges of these types of efforts?

"I think there's a few... one is that there is no central authority for those types of data. Tons of variety in state and local government, both in terms of data structure and organizational structure.

Restaurant inspections- federal gov makes recommendations. (through FDA). But no guidance on how much weight to apply to each thing. Cities/ counties weight things in whatever way they want, along with stuff on their own etc. How each inspection is done varies widely by jurisdiction.

"Organizational ego comes into this as well" "people will do what it takes the data to get it into the format, so we're not going to align to a common set of standards." little bit of ego, some territorialism, some varying business practices by jurisdiction.

What incentivizes people or overcomes challenges?

When there's a clear use case that has a very clear case of adoption. Example: Google transit. Clear use case for broad potential for adoption. Fewer of those clear use cases in other domains. If there is a clear value add business case for both the provider and the consumer, it's going to happen.

Another one is benchmarking. Big appetite for small cities. Federated standards could be successful there.

what about tools & technologies?

mailing lists, discusssion boards, slack, decent websites are good places for convening the people, biweekly/ monthly calls to convene people. On technology side "I don't thing there's anything that great". Concept of data packages - getting data / meta data together in zip files. As soon as you get into taxonomies, it varies quite widely. NIEM.

what are some organizational challenges faced by these efforts?

"Both political and egotistical" "Fear of failure or risk of exposure can also be a barrier" . e.g., people might have delinquent policies / lax standards for inspections.

Data ingestions is a challenge- riskier than publishing. Concerns about security, abuse, accuracy. Hesitant to ingest data from non-government sources. Both gov and non-gov data pose problem / liability for ingestion. Long term data exchange relationships develop legacy issues over time, so ingestion of new data a problem for that as well.

"There are a lot of barriers, both technical and human, to doing this type of work, but standards in a particular are the key to getting data sharing to scale." "I do think there are these needs for inter-organizational managing bodies to coordinate this stuff" "need an independent convener, otherwise it can stall out or lose political momentum"

What's the best home for efforts like this?

e.g., lives spec by yelp dominated by private sector and doesn't work for a lot of cities.
e.g., transit spec- successful ownership by private sector. For having inter-organizational conversations, 3rd party very helpful as mediator.

Interview: Geospatial Efforts

Introductory comments

What is the data federation effort all about? What am I looking to get out of it?
This is a collaborative research project with GSA's Office of Products & Platforms and 18F. The goal is to build a toolkit / playbook for undertaking intra-governmental data collection / aggregation projects, such as data.gov, code.gov, and NIEM. Or goal is to find out what works, what doesn't, and what tools are appropriate for what circumstances, in order to accelerate similar efforts in the future.
Notes & Report will be public
Any questions before we get started?

What is X, in your own words?

Federal geographic data committee office of secretariat. Made up for 32 federal agencies, most have geospatial part of portfolio. Steering committee - executives who have geo spatial as part of portfolio. FGDC governed by A-16, WW2 effort trying to organize geospatial for civilians. Geospatial has been around for a long time "it's what the government does." A-16 revised in 1990, FGDC organized as coordinating body. National Spatial Data Infrastructure (NSDI) - includes non-federal and non-governmental datasets as well. About 200 members, cross-cutting view across different sectors. Metadata standards, geospatial standards. Content standards for digital geospatial metadata (CSDGM). Different agencies had already started. Established clearing house, a way to aggregate information by pulling it into database you could search. 2011 / 2013, geospatial world faced with open data policy. Old problem in geo world! Complementary to geo efforts. Data.gov has done a good job allowing geospatial community to keep going with what . ISO 19115 - new standard. Did crosswalk between those ISO standards and data.gov schema. data.gov allows harvest sources to be tagged as geospatial.

What was impetus or driving force for this effort: policy, user needs, etc (perhaps after the first question)

[covered]

In building X, what were the biggest challenges, and what went smoothly?

challenge for agencies: asking them to "do it twice". Success: data.gov is "the place. But some agencies question why we need to do geospatial thing? (answer: geospatial standards a lot more robust). Can transform using crosswalk, but it's extra work. Project open data requires project / program codes, not represented in geo standards. Also licensing was a problem. On policy side: monthly meetings, get together, bring up issues. e.g., errors in data.gov, teams of interagency folks got together to do crosswalks. People / policy / technical.

What tools and technologies do you use for this effort?

"each agency is a little bit different", but data.gov sets standards. For creation, using vendor tools. But ultimately everything is in XML going through pipeline. ISO standards use XML. Meta data creation: "hundreds" of tools. tools all over the place. "A lot of it is what data.gov offers" - everything has to be in json or ISO formats for metadata. Old standards crosswalked into ISO.

any gaps or tools / technologies that could have helped?

"you think you have it covered, then somebody throws a curveball at you." started in 1998, services / APIs didn't exist, technologies change over time. Things that are now in the data that were part of it. "Of anything, it would be take the time to do it right." better definitions, dictionaries, all that stuff. The more you can spend time up front, the better. Using common vocabularies, "gosh, that would have been nice" - to have some structure. In early days, people didn't want to do it. acronyms don't even make sense over time. "Stop putting people into these things" - think about organization, not individual. "Be more universal in any way that you can."

What are the political and organizational dynamics of collecting this data?

"That's part of the role of FGDC, is to provide that facilitation across agencies." For example, addresses. but when you start putting it in context of 911 emergency response, can be a lot more complex. Need for national address database, every state / locality does it a little differently. Trying to get a broader picture is really hard. DOT taking lead along with U.S. census bureau. They've said they needed it, then trying to get right people at the table. "Sharing information can always be an issue" - some states due to privacy regs in their states, can't share it.

Who were the relevant stakeholders for this project, how were they identified and convened?

"I would start with any known standards bodies, and there's a lot of them out their for different topics area" (in geospatial, it's ISO technical committee 211). in U.S., American Standards National Institute (ANSI). See what those organizations are doing, trying to talk to some folks. See website, ask people if they're working in standards arena. "The more consistent you can be, everyone benefits from that."

Interview: Tim Wisniewski (PHL CDO)

Introductory comments

What is the data federation effort all about? What am I looking to get out of it?
This is a collaborative research project with GSA's Office of Products & Platforms and 18F. The goal is to build a toolkit / playbook for undertaking intra-governmental data collection / aggregation projects, such as data.gov, code.gov, and NIEM. Or goal is to find out what works, what doesn't, and what tools are appropriate for what circumstances, in order to accelerate similar efforts in the future.
Notes & Report will be public
Any questions before we get started?

What kinds of data does the city collect, and from whom?

city of philadelphia: parking authority data, school district data
opendataphilly.org: includes SEPTA, university data, yelp data from restaurants. Transity authority / school district/ courts not run by city.
many departments report: addresses, how they relate to parcels. Not every department agree on addresses or parcel addresses, leads to a lot of difficulty in matching records and leveraging data. Team in the IT dept (cityGEO) making a shared service called Address Information Service. Takes address and gives all city data related to that address. "Cities are inherently geographic entities, this comes up pretty often" Much better than trusting sources of record. Also require standard metadata from datasources from data sources from the public for opendataphilly.org.

Sometimes there's a big contrast in the day-to-day reality [of data standards] with realities on the ground: mainframes ranging to modern web apps. One example: really great if you could type in a property owners name and what addresses / contact info is. Then you look at property assessment system, and it's a mainframe where owner field has 16 characters. Sometimes truncated, sometimes spills into multiple fields. Sometimes just getting data out of a mainframe into RDB can bring you miles forward.
"many departments have been sharing data in an ad-hoc way for quite some time" Open data brings in more transparency and technology to.

opendataphilly standard: was a data inventory task a few years back. GIS group had been facilitating data sharing for several years, used that as a foundation. Then published to metadata.philly.gov. People fill in metadata in an application. data.json "wouldn't work here" - not every department has IT people. CKAN doesn't support data dictionaries. metadata.philly.gov includes datasets that are not shared publicly.

What was impetus or driving force for this efforts to collect data: policy, user needs, etc (perhaps after the first question)

In collecting data, what were the biggest challenges, and what went smoothly?

for keeping metadata.philly.gov up to date: trained them on how to use it, got feedback on what was confusing, added tips & notes. Were using arccatalog before, it's public, works in the browser. Still have to nudge & remind, but there's a process in place to check that the metadata is up to date before published. "Departments come to us pretty frequently asking us to publish data" "A lot of support for it" "sometimes it's publishing to the public, sometimes it's just sharing with other departments" Sometimes department staff fill out, sometimes they fill it out for them. Why publish to public? A lot of departments that do amazing work. Usually publishing data accompanied by visualizations & blog post / press release. Lots of times they get requests from the media / city council, reduces burden in that way.
"The easier you can make it to comply, the more people will comply" Generally people "get it" "It's more that everybody's got a million things going on" "The easier that you can make it... like making the UI streamlined for common workflows" "employing user experience design, where the users are data publishers."

What tools and technologies do you use for this effort?

opendataphilly - stored in cartodb . ETL for city databases to carto: python scripts, scheduled in in-house tool called taskflow (scheduler / background task runner). carto provides download links. CKAN hosts the links. Application builder Knack for metadata entry. SaaS product (like MS Access in the cloud).

Why did you choose this architecture or process. Were others tried, etc (after the "data aggregation/distribution" question)

"A result of trying it other ways and learning what the pain points were" "Constantly trying to consolidate, and simultaneously improve that infrastructure" used socrata before carto, published to github before that. Vizwit made to provide some visualization capability th. "Largely because we have been able to embrace open source software and publish open source software, it has given us a lot of flexiblity."

What are the political and organizational dynamics of collecting this data?

Who were the relevant stakeholders for this project, how were they identified and convened?

Is there anyone else I should speak with to better understand X?

As a team member, I want a policy on code review / PR processing.

User story

As a team member, I want a policy on code review / PR processing.

Acceptance criteria

We have a policy!
We agree to a policy!

Review and respond to the open data standards draft guide.

Acceptance criteria

Respond to open data standards draft guide (Chris / Phil)
Feedback by April 6th, and / or being aware of it to inform/structure this work (and avoid duplication).

As a partner, I want to see an example of the data ingest tool working against my data so I can understand how it might benefit me.

User story

As a partner, I want to see an example of the data ingest tool working against my data so I can understand how it might benefit me.

Acceptance criteria

Define table schema for a use case
Set up and document django project implementing that schema
Demonstrate data ingest tool working against a client’s use case
Deploy the demo to cloud.gov

Follow up with OMB on use case. [stub]

User story

Follow up with OMB on use case.

Acceptance criteria

Grants project
Program codes project

Document existing data initiative maturity models or quality assessment frameworks

As a partner, I want to understand how to establish roles, responsibilities, and requirement docs (e.g. schemas) for client use cases.

User story

As a partner, I want to understand how to establish roles, responsibilities, and requirement docs (e.g. schemas) for client use cases.

Acceptance criteria

User who receives data from federated source and validates/aggregates
User who prepares data as a federated source
Develop/receive technical requirements documentation/schemas

Interview: OpenReferral

Introductory comments

What is the data federation effort all about? What am I looking to get out of it?
This is a collaborative research project with GSA's Office of Products & Platforms and 18F. The goal is to build a toolkit / playbook for undertaking intra-governmental data collection / aggregation projects, such as data.gov, code.gov, and NIEM. Or goal is to find out what works, what doesn't, and what tools are appropriate for what circumstances, in order to accelerate similar efforts in the future.
Notes & Report will be public
Any questions before we get started?

What is X, in your own words?

The issue is directory information about human services. Offering open standards / interoperable APIs / new practices and business models that promote the provision of this data as a public. Information about resources for people in need to be like a public utility. Use case that inspired early on: publish resource data googlable. Google introduced civic services for schema.org. Don't have incentives to publish the right. Dealing with info not collected by governments. Call centers, publishing data on webpage. Being the bridge between closed formats and published data. OpenReferral is the exchange layer, schema.org provides publishing layer. Adopt openreferral standard, then easy to publish schema.org, share with other organizations.

What was impetus or driving force for this effort: policy, user needs, etc (perhaps after the first question)

He saw these needs popping up over and over again in DC. Saw trend: market failure, more and more apps provide less and less trustworthy information. Everyone competing to be "the one." If one of the private companies succeeds, that's also bad! Then they own the public information. "We need public goods and infrastructure before competition." "If you don't have public goods and infrastructure, you are cruising for various bruisings."

In building X, what were the biggest challenges, and what went smoothly?

Biggest challenge: cultural. complex, long term work, multi-stakeholder collaboration. We want quick, short term, linear-to-scale wins. Mismatch between expectations and reality. "Difficult to get people to invest resources in things like infrastructure rather than shiny apps." Other cultural dimension: treated like a market. Instead of investing, "just build the best moustrap." "skepticism of the value of cooperation." "What we're finding is that that can be unlearned." "We can rediscover that value of cooperative logic... it takes a first mover, a second moved, but once people are collaborating, the logic is hard to beat." "It requires that investment of resources in order to align incentives around that value proposition."

Gone well: "I don't think the world needs more apps, but we have several apps that are built around the standard." "Having tools that everyone needs that they can plug & play has worked." Community built (Phil, Open Data UK, etc).

In interests to do this, but they don't have the capacity to do the things that will save them capacity. Simply making the case is not enough, you need to make the case and then get additional resources to come.

What has gone well: found funding that was allocated to doomed projects and able to demonstrate value. (took many years).

What tools and technologies do you use for this effort?

Tech around standard: started with google docs, now on github with readthedocs documentation, using hypothesis for commenting. Standard itself: vocabulary, logical model, formatting instructions. formatting: csv. Objective: be simple for folks to open up and edit. But domain is more complex than GTFS. e.g., one institution with multiple locations. Then json data package + csvs. Now they have an open API spec. Next chapter of evolution: likely API-first with json / csv as one formatting option. They're not harvesting the data. Hypothesis 1: self publishing (peer-to-peer). Hypothesis 2: centralized hub for community (one-to-many). Average area: 1-2 employees, how to recoup the cost of the data if it should be available for free? Hypothesis 3: many-to-many federation. Many intermediaries. Different intermediaries can share data in a federated network.

Why did you choose this architecture or process. Were others tried, etc (after the "data aggregation/distribution" question)

What are the political and organizational dynamics of collecting this data?

improve service delivery, reduce cost of maintaining information, improve ability for decision makers to assess the allocation of resources against community needs.