GithubHelp home page GithubHelp logo

q-m / food-ingredient-parser-ruby Goto Github PK

View Code? Open in Web Editor NEW
16.0 4.0 2.0 12.51 MB

Extract the structure of ingredient lists on food products

License: MIT License

Ruby 100.00%
ingredients food-additives parser food-products structured-data ruby-gem ruby ingredient-lists treetop

food-ingredient-parser-ruby's Introduction

Food ingredient parser

Gem Version

Ingredients are listed on food products in various ways. This Ruby gem and program parses the ingredient text and returns a structured representation.

Installation

gem install food_ingredient_parser

This will also install the dependency treetop. If you want colored output for the test program, also install pry: gem install pry.

Example

require 'food_ingredient_parser'

s = "Water* 60%, suiker 30%, voedingszuren: citroenzuur, appelzuur, zuurteregelaar: E576/E577, " \
    + "natuurlijke citroen-limoen aroma's 0,2%, zoetstof: steviolglycosiden, * = Biologisch. " \
    + "E = door de E.U. goedgekeurde toevoeging."
parser = FoodIngredientParser::Strict::Parser.new
puts parser.parse(s).to_h.inspect

Results in

{
  :contains=>[
    {:name=>"Water", :amount=>"60%", :marks=>["*"]},
    {:name=>"suiker", :amount=>"30%"},
    {:name=>"voedingszuren", :contains=>[
      {:name=>"citroenzuur"}
    ]},
    {:name=>"appelzuur"},
    {:name=>"zuurteregelaar", :contains=>[
      {:name=>"E576"},
      {:name=>"E577"}
    ]},
    {:name=>"natuurlijke citroen-limoen aroma's", :amount=>"0,2%"},
    {:name=>"zoetstof", :contains=>[
      {:name=>"steviolglycosiden"}
    ]}
  ],
  :notes=>[
    "* = Biologisch",
    "E = door de E.U. goedgekeurde toevoeging"
  ]
}

Test tool

The executable food_ingredient_parser is available after installing the gem. If you're running this from the source tree, use bin/food_ingredient_parser instead.

$ food_ingredient_parser -h
Usage: bin/food_ingredient_parser [options] --file|-f <filename>
       bin/food_ingredient_parser [options] --string|-s <ingredients>

    -f, --file FILE                  Parse all lines of the file as ingredient lists.
    -s, --string INGREDIENTS         Parse specified ingredient list.
    -q, --[no-]quiet                 Only show summary.
    -p, --parsed                     Only show lines that were successfully parsed.
    -n, --noresult                   Only show lines that had no result.
    -r, --parser PARSER              Use specific parser (strict, loose).
    -e, --[no-]escape                Escape newlines
    -c, --[no-]color                 Use color
        --[no-]html                  Print as HTML with parsing markup
    -v, --[no-]verbose               Show more data (parsed tree).
        --version                    Show program version.
    -h, --help                       Show this help

$ food_ingredient_parser -v -s "tomato"
"tomato"
RootNode+Root3 offset=0, "tomato" (contains,notes):
  SyntaxNode offset=0, ""
  SyntaxNode offset=0, ""
  SyntaxNode offset=0, ""
  ListNode+List13 offset=0, "tomato" (contains):
    SyntaxNode+List12 offset=0, "tomato" (ingredient):
      SyntaxNode+Ingredient0 offset=0, "tomato":
        SyntaxNode offset=0, ""
        IngredientNode+IngredientSimpleWithAmount3 offset=0, "tomato" (ing):
          IngredientNode+IngredientSimple5 offset=0, "tomato" (name):
            SyntaxNode+IngredientSimple4 offset=0, "tomato" (word):
              SyntaxNode offset=0, "tomato":
                SyntaxNode offset=0, "t"
                SyntaxNode offset=1, "o"
                SyntaxNode offset=2, "m"
                SyntaxNode offset=3, "a"
                SyntaxNode offset=4, "t"
                SyntaxNode offset=5, "o"
              SyntaxNode offset=6, ""
        SyntaxNode offset=6, ""
      SyntaxNode offset=6, ""
  SyntaxNode+Root2 offset=6, "":
    SyntaxNode offset=6, ""
    SyntaxNode offset=6, ""
    SyntaxNode offset=6, ""
  SyntaxNode offset=6, ""
{:contains=>[{:name=>"tomato"}]}

$ food_ingredient_parser --html -s "tomato"
<div class="root"><span class='depth0'><span class='name'>tomato</span></span></div>

$ food_ingredient_parser -v -r loose -s "tomato"
"tomato"
Node interval=0..5
  Node interval=0..5, name="tomato"
{:contains=>[{:name=>"tomato"}]}

$ food_ingredient_parser -q -f data/test-cases
parsed 35 (100.0%), no result 0 (0.0%)

If you want to use the output in (shell)scripts, the options -e -c may be quite useful.

to_html

When ingredient lists are entered manually, it can be very useful to show how the text is recognized. This can help understanding why a certain ingredients list cannot be parsed.

For this you can use the to_html method on the parsed output, which returns the original text, augmented with CSS classes for different parts.

require 'food_ingredient_parser'

parsed = FoodIngredientParser::Strict::Parser.new.parse("Saus (10% tomaat*, zout). * = bio")
puts parsed.to_html
<span class='depth0'>
  <span class='name'>Saus</span> (
  <span class='contains depth1'>
    <span class='amount'>10%</span> <span class='name'>tomaat</span><span class='mark'>*</span>,
    <span class='name'>zout</span>
  </span>)
</span>.
<span class='note'>* = bio</span>

For an example of an interactive editor, see examples/editor.rb.

editor example screenshot

Loose parser

The strict parser only parses ingredient lists that conform to one of the many different formats expected. If you'd like to return a result always, even if that is not necessarily completely correct, you can use the loose parser. This does not use Treetop, but looks at the input character for character and tries to make the best of it. Nevertheless, if you just want to have some result, this can still be very useful.

require 'food_ingredient_parser'

parsed = FoodIngredientParser::Loose::Parser.new.parse("Saus [10% tomaat*, (zout); peper.")
puts parsed.to_h

Even though the strict parser would not give a result, the loose parser returns:

{
  :contains=>[
    {:name=>"Saus", :contains=>[
      {:name=>"tomaat", :marks=>["*"], :amount=>"10%", {
        :contains=>[{:name=>"zout"}
      ]},
      {:name=>"peper"}
    ]}
  ]
}

Compatibility

From the 1.0.0 release, the main interface will be stable. This comprises the two parser's parse methods (incl. documented options), its nil result when parsing failed, and the parsed output's to_h and to_html methods. Please note that parsed node trees may be subject to change, even within a major release. Within a minor release, node trees are expected to remain stable.

So if you only use the stable interface (parse, to_h and to_html), you can lock your version to e.g. ~> 1.0. If you depend on more, lock your version against e.g. ~> 1.0.0 and test when you upgrade to 1.1.

Languages

While most of the parsing is language-independent, some parts need knowledge about certain words (like abbreviations and amount specifiers). The gem was developed with ingredient lists in Dutch (nl), plus a bit of English and German. Support for other languages is already good, but lacks in certain areas: improvements are welcome (starting with a corpus in data/).

Many ingredient lists from the USA are structured a bit differently than those from Europe, they parse less well (that is probably a matter of tine-tuning).

Test data

data/ingredient-samples-qm-nl contains about 150k real-world ingredient lists found on the Dutch market. Each line contains one ingredient list (newlines are encoded as \n, empty lines and those starting with # are ignored). The strict parser currently parses 80%, while the loose parser returns something for all of them.

License

This software is distributed under the MIT license. Data may have a different license.

food-ingredient-parser-ruby's People

Contributors

wvengen avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

food-ingredient-parser-ruby's Issues

Amount dropped with mark in loose parser

$ bin/food_ingredient_parser --version
food_ingredient_parser v1.1.1
$ bin/food_ingredient_parser -r loose -s "foo 50%"
foo 50%
{:contains=>[{:name=>"foo", :amount=>"50%"}]}
$ bin/food_ingredient_parser -r loose -s "foo* 50%"
foo* 50%
{:contains=>[{:name=>"foo", :marks=>["*"]}]}

Ingredients with note containing amount not parsed

The ingredient declaration

Pork, Paprika, Salt, Maize Dextrose, Maize Dextrin, Garlic, Paprika Oil, Stabiliser: Pentasodium Triphosphate, Antioxidant: Sodium Erythorbate, Preservative: Sodium Nitrite, Filled into natural Pork casings, Prepared with 120g of Pork per 100g of product

is not parsed at all by the strict parser.

Handle separator in "stabilisatoren: e407-e412-e415"

Sometimes dash is separator: stabilisatoren: e407-e412-e415 (but not always: kleurstof: paprika-extract). Handle this separator.

Happens in about 0.2% of ingredient lists.
Run grep -i ':\s*e[0-9]\+-e[0-9]\+' data/ingredient-samples-nl for examples.

Compare with PulseFoodInnovation

In 2022 Data in Brief had an article about parsing ingredients, including the data of product ingredient lists from various countries. It contains barcodes, ingredient listings and parsed ingredients.

It would be useful to compare the two parsers of this project, and see how they compare to the dataset.

part after bracketed text with 'and' dropped

In this ingredient declaration:

Wheat Flour [with Calcium, Iron, Niacin (B3) and Thiamin (B1)] and Wholemeal Wheat Flour, Water, Yeast, Vegetable Oils (Sunflower, Rapeseed and Sustainable Palm in varying proportions), Salt, Wheat Gluten, Malted Barley Flour, Emulsifiers: E471, E472e, Soya Flour, Preservative: Calcium Propionate, Flavouring, Flour Treatment Agent: Ascorbic Acid (Vitamin C)

currently "Wholemeal Wheat Flour" is completely dropped from the parsed output.
A solution is probably to recognize and here as a separator.

Fix colon with bracketed amount

The ingredients list sauce: (50%) tomato, salt is incorrectly parsed by both the strict and loose parser.

strict: no result
loose: {:contains=>[{:name=>"sauce", :amount=>"50%"}, {:name=>"salt"}]}

This would be good to fix.

Handle semicolon after colon

An ingredients list like "Schokolade (Süßungsmittel: Maltit; Kakaobutter, Kakaomasse)" contains mixed separators (; and ,). Hiere the semicolon is used to indicate the end of the second-level nesting for Maltit.

Unhandled mixed brackets

49% varkensvlees (Beter Leven keurmerk 1 ster), 23% satésaus (water, 22% PINDAKAAS, suiker, gemodificeerd maïszetmeel, SOJASAUS [water, SOJABONEN, TARWE, zout], azijn, zonnebloemolie, aroma's, zout, specerijen [paprikapoeder, knoflook, komijn], verdikkingsmiddelen [E412/ E415], sambal [rode peper, zout], conserveermiddelen [E202/ E211], zuurteregelaar [E575]), paneermeel (TARWEBLOEM, kurkuma, gist, zout, kleurstoffen [E100/ E160b]), raapolie, water, zetmeel (TARWE, maïs), TARWEBLOEM, varkensvet (Beter Leven keurmerk 1 ster), zout (zeezout, zout), koriander, gemberpoeder, gistextract, natuurlijk aroma, voedingszuur (E330) , zuurteregelaar (E262).

Handle chemical names better

Some chemical names have numbers in them, these are not recognized or by the parser or wrongly parsed.

  • 1-hydroxyethyl 4,5-diamino pyrazole sulfate
  • 1,2-benzisothiazol-3
  • 2,3-trimethyl-2-isopropylbutanamide
  • 2-bromo-2-nitropropane-1,3-diol
  • 2,3-Dihydro-1,1-dimethyl-1H-indene-ar-propanal
  • 2-Amino-4-Hydroxyethylaminoanisole Sulfate
  • 3-benzodioxole-5-propionaldehyde
  • dinatrium-5-guanylate
  • dinatrium-5'- ribonucleotiden
  • dinatrium-5’-ribonucleotiden
  • dinatrium uridine 5'-monofosfaat

Handle weight and percentage both being specified

Sometimes both a percentage and a weight is specified, maybe we need to support both:

Double chocolate cookiemix 400 g (89,9%): TARWEbloem [GLUTEN], suiker, palmolie, zonnebloemolie, glucosestroop, magere cacaopoeder 7,2%, KIPHEELEIpoeder (scharrel), MELKeiwit, GERSTemoutbloem [GLUTEN], zout, rijsmiddelen: E500-E450, aroma. Witte chocolade chunks 45 g (10,1%): suiker, volle MELKpoeder, cacaoboter, emulgator: SOJAlecithine, aroma.

Colon in ingredient name

foo: (bar) returns a first ingredient name with colon for the loose parser, while it should be omitted.

Coloned amount not parsed correctly

With the loose parser, a, <1%: b yields:

{:amount=>"<1%", :contains=>[{:name=>"a"}]}

though it should yield

{:contains=>[{:name=>"a", :contains=[{:name=>"b", :amount=>"<1%"]}]}

'and' not parsed in coloned list context

This ingredients list is not parsed by the strict parser:

saus (83%): tomaten*, uien*, wortelen*, koudgeperste zonnebloemolie*, ongeraffineerd zeezout, basilicum* (0,3% van het eindproduct), oregano*. pasta (10,3%): tarwegriesmeel*, verse eieren* (20% van de pasta), water. vulling : aubergine*, paneermeel*, tomaten*, basilicum*, olijfolie* en ongeraffineerd zeezout. * Van biologische herkomst

or, more targeted:

vulling: olie* en zout*

even though olie* en zout* is parsed. So with a coloned ingredient the 'and' separator is not recognized.

Handle **ingredient**

Sometimes ingredients are surrounded by double asterisks, this is probably marking of an allergen (see also #4). The strict parser doesn't currently handle this (or recognizes it as the start of notes), and the loose parser recognizes the first ** as mark and includes the second ** in the resulting name.

This happens in 0.15% of the ingredient lists.

Handle multiple marks

A small number of products has multiple mark symbols. Currently only one is supported.

  • Rename mark to marks and make it an array (for future API compatibility)
  • Fully recognize and handle multiple marks

An example is Dr. Bronner's Shikakai soap teatree with ingredients:

INCI: Vitis Vinifera Juice*, Sucrose*^, Cocos Nucifera Oil* (***), Potassium Hydroxide, Olea Europaea* (***), Melaleuca Alternifolia*, Accacia Concinna (Shikakai) Nut Powder*, Citric Acid, Cannabis Sativa Seed Oil*, Buxus Chinensis (Jojoba) Seed Oil*, Tocopherols (vitaminw E), d-Limonene***. * Van biologische herkomst *** Etherische olie ^ Volgens fairtrade-normen verhandeld

and

Water, groentenª¹ 26,3% (broccoli 17,2%, erwt 2,5%, prei, ui 2,6%, spinazie), aardappelª¹, ROOMª 5,9%, maïszetmeelª, raapzaadolieª, zout, rietsuikerª, gistextract, nootmuskaatª, aroma, ª afkomstig van gecontroleerde biologische landbouw., ¹op duurzame wijze geteeld.

strip "ingredient list"

We currently strip "ingredients" but not "ingredients list" in front of the ingredient declaration. E.g.

INGREDIENTS LIST: Wheat Flour [Wheat Flour, Calcium Carbonate, Iron, Niacin, Thiamin), Water, Sunflower Seed (8%), Rye Flour (5%), Yeast, Rye (2%), Toasted Rye Flakes (2%), Wheat Gluten, Salt, Barley Malt Flour, Emulsifiers (Mono- and Diacetyl Tartaric Acid Esters of Mono- and Diglycerides of Fatty Acids, Mono- and Diglycerides of Fatty Acids), Malted Rye Flour, Preservative (Calcium Propionate), Rapeseed Oil, Flour Treatment Agent (Ascorbic Acid).

Better note detection

Detection of ingredient notes could be improved (especially with the loose parser, I'd think). There are often-occuring ingredient notes that can be recognized, maybe even before parsing (as pre-processing).

A list of Dutch ingredient notes: ingredient_notes.xlsx

Improve handling of 'and'

There is some rudimentary support in the strict parser, but it has some issues (some valid ingredient lists with 'and' are not parsed).

Handle amount after nesting colon

The following ingredients list doesn't parse with the strict parser, it thinks 36% belongs to wraphapje mozarella-tomaat instead of tomatenwrap:

Wraphapje mozzarella-tomaat: 36% tomatenwrap , 29% half zongedroogde tomaat (27% tomaat, zonnebloemolie, knoflook, zout, oregano, marjolein, peterselie), 20% mozzarella , 15% groene pesto . , Wraphapje geitenkaas-beenham: 41% geitenkaas , 33% wrap , 22% beenham , 4% honing. Allergie-informatie: bevat tarwe (gluten), lactose, melkeiwit, ei, cashewnoot, geitenmelkeiwit. Gemaakt in een bedrijf waar ook pinda's en andere noten worden verwerkt.

Mark in front confuses loose parser

With listed ingredients *Kappertjes (58%), *wijnazijn (21%), water, zout, *uit de biologische landbouw., the loose parser returns the full ingredients list as first ingredient.

Add loose parser

Since only three quarters of all ingredient lists can be parsed, it would be useful to have a less strict parser that tries to make the best of what it can parse.

  • Parse nested ingredients with brackets.
  • Parse nested ingredients with colon.
  • Parse marks.
  • Separate amounts from names.
  • Parse amounts within brackets.
  • Separate notes from ingredients.
  • Fix note parsing.
  • Handle 'and' correctly (needed or not?)

Detection of allergens

There are various ways in which allergens are codified. It may be useful to add this to the structured data output.

Some variations seen:

  • ..., volle {melk}, ...
  • ..., volle MELK, ...
  • ... . Kan melk bevatten. (various forms like sporen van, etc.)
  • ..., kan melk bevatten. (ibid)
  • ..., botersaus (bevat melk), ...
  • ..., VISsaus (VIS), ...
  • ..., KAAS (17%) (EDAMMER (kleurstof: bèta-caroteen, MOZZARELLA), water, ...
  • ..., <b>melk</b>, ...
  • ..., **roomboter** 74%, ...

Different forms can also be intermixed.

Make gem

so that it can be used more easily in other projects.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.