LCF-Project

Living Costs and Food Survey (LCF) Project repository

Executive Summary

For a short summary of the Living Costs and Food Survey (LCF) Project please see the following document .

Overview

The ONS Big Data team was approached by ONS Social Survey Division (SSD) about the possibility of using commercial data and/or data science methods to help improve the processing of the Living Costs and Food Survey (LCF).

Figure 1. LCF Diary process

In order to facilitate the LCF diary process, two prototypes were developed by the Big Data Team in consultation with the Social Survey Division, Surveys and Life Events Processing and the end user DEFRA.

The proposed solutions harness information from clean historic LCF diary data to help complete missing product quantity information (i.e. amount, volume or weight purchased) at the point of data entry.

strand A: Using historical data to create a lookup

https://github.com/ONSBigData/LCF-project/tree/master/LCF-analysis

Entering LCF data from diaries into the database takes a significant amount of time. Currently it is done in a system called Blaise and the most resource intensive part is the amount (weight) information retrieval as it is often missing in the diary / on the receipt.

Although the customer (DEFRA) only requires amounts to be completed for half of the survey respondents, the additional time taken to find the correct amounts (usually via internet searches outside of Blaise) is a large contributing factor in diary processing delays.

A solution which could integrate easily into the current system and coders’ work flow was piloted using flat look-up functions already available in Blaise. The goal was to give the coder an option to choose an amount from a list of matching or very similar items previously entered within the Blaise environment (eliminating the need for an internet search on different machine or browser).

Figure 2. LCF flat file solution process

The picture above shows a summary of the data processing pipeline for the flat-file prototype. The prepared lists get exported into a CSV file and handed over to the Blaise team, who convert them into a (proprietary) format suitable for loading from within the questionnaire.

Each look-up file still contains a lot of items and therefore the items’ ordering is important. When the look-up file opens in Blaise, the position of the cursor needs to be such that the next few products are the most similar to what the coder is looking for.

This has been achieved by a modified K-Nearest Neighbour classification algorithm.

strand B: Using a SOLR-based indexing solution

https://github.com/ONSBigData/LCF-project/tree/master/LCF-shiny

As it was mentioned above,entering LCF data from diaries into the Blaise takes a significant amount of time, and the most time consuming part is the amount (weight) information retrieval as it is often missing in the diary / on the receipt.

Figure 3. Screenshot of Blaise system

Another solution proposed by the Big Data team was a system that is using a SOLR -based server in order to help with automatic COICOP classification and to provide the most probable weight for items based on the item cost.

SOLR is an open source, Lucene -based search engine library providing scalable enterprise indexing and search technology. Initially records created from historical LCF data are indexed so that they could be retrieved quickly based on requested criteria. By default, SOLR uses a modified TF-IDF method to calculate a similarity score between the query and all available historical LCF data

A Shiny app was created to mimic the BLAISE system in appearance and functionality in order to show how this could work from within BLAISE

Figure 4.Shiny App simulating BLAISE interface using SOLR backend to predict COICOP and propose weights

Setup instructions for installing and configuring SOLR on Ubuntu

SOLR schema currently used for this project:

      <fields>

      <field name="line" type="string" indexed="true" stored="true" required="true"/>
      <field name="coicop" type="integer" indexed="true" stored="true"/>
      <field name="EXPDESC" type="text" indexed="true" stored="true"/>
      <field name="Paid1" type="float" indexed="true" stored="true"/>
      <field name="Shop" type="text" indexed="true" stored="true"/>
      <field name="MAFFQuan" type="float" indexed="true" stored="true"/>
      <field name="MAFFUnit" type="text" indexed="true" stored="true"/>

      </fields>

addendum I: LCF Scanning Receipt Optical Character Recognition Shiny app prototype

https://github.com/ONSBigData/LCF-project/tree/master/LCFshinyReceiptOCR

A shiny app that can OCR a default receipt picture using the Tesseract OCR library or any other picture uploaded was created as a starting point for looking into getting information from receipts into a textual format so it can be processed, matched,parsed etc.

Figure 5. LCF Receipt Scanning minimal Shiny Application

In order to install the Tesseract library the following setup instructions for installing all Ubuntu requirements for this app are provided together with a link to hints/tips/suggestions on how to improve the quality of performing OCR with Tesseract .

addendum II: prototype COICOP Classification using Scikit-Learn jupyter notebook

https://github.com/ONSBigData/LCF-project/tree/master/LCF-COICOPclassification

A jupyter notebook containing 3 types of scikit learn classifiers (machine learning algorithms) trained to automatically assign a COICOP code based on a product description.

Naive Bayes
Support Vector Machines
Random Forests

Additionaly a jupyter notebook containing a Python implementation of the BM25 algorithm used in products such as Apache Lucene and SOLR

Contributors

Iva Spakulova

Theodore Manassis

Alessandra Sozzi

working for the Office for National Statistics Big Data project

LICENSE

Released under the MIT License.

onsbigdata / lcf-project Goto Github PK

lcf-project's Introduction

LCF-Project

Executive Summary

Overview

Figure 1. LCF Diary process

strand A: Using historical data to create a lookup

Figure 2. LCF flat file solution process

strand B: Using a SOLR-based indexing solution

Figure 3. Screenshot of Blaise system

Figure 4.Shiny App simulating BLAISE interface using SOLR backend to predict COICOP and propose weights

addendum I: LCF Scanning Receipt Optical Character Recognition Shiny app prototype

Figure 5. LCF Receipt Scanning minimal Shiny Application

addendum II: prototype COICOP Classification using Scikit-Learn jupyter notebook

Contributors

LICENSE

lcf-project's People

Contributors

Watchers

Forkers

Recommend Projects

Recommend Topics

Recommend Org

Jobs