GithubHelp home page GithubHelp logo

ykankaya / ptbr-wordlist Goto Github PK

View Code? Open in Web Editor NEW

This project forked from 0x00-0x00/ptbr-wordlist

0.0 2.0 0.0 144 KB

A selection of Brazilian Portuguese words that are good enough to be used as a BIP-0039 wordlist

ptbr-wordlist's Introduction

Brazilian Portuguese Wordlist

Objectives

The idea is to build a selection of Brazilian Portuguese words that are good enough to be used on computer-generated mnemonic pasphrases, such as the ones described in the BIP-0039 draft and adopted by some cryptocurrencies clients and wallets.

Attributes of a "good enough" wordlist

Following the footsteps of other languages wordlists currently listed in this BIP39 document, specially the French and Spanish ones, the selection of words for this list must consider the following features/attributes:

  • High priority on simple and common portuguese words.
  • Only words with 5-8 letters.
  • Words can be uniquely determined typing the first 4 characters (sometimes less).
  • No identical words with the Spanish, English or French wordlists

In addition to that, some other attributes like:

  • No accented words (we avoid words containing á,é,ê,ç… to keep it simple)
  • High priority to words that could be suggested by a mobile OS (words with a high frequency score on the Android dictionary, for example)

This list of attributes is still a work in progress, feedback is welcome (see the Discussion section below).

Discussion

There is an ongoing discussion about it at the bitcointalk forum, please drop by and say hello if you want to contribute :)

Build

To build all lists use:

$ npm test

Development Scripts

The package.json file have some npm-scripts that are shortcuts for shell command lines, if you have node.js installed you can run then with:

$ npm run <script_name>

The script names are listed bellow with their equivalent command lines:

getdict:bip39

Download the bip39 recommended wordlists for Spanish, French and English

Equivalent shell:

$ cd ./downloads
$ curl -O https://raw.githubusercontent.com/fczuardi/bips/master/bip-0039/english.txt
$ curl -O https://raw.githubusercontent.com/fczuardi/bips/master/bip-0039/spanish.txt 
$ curl -O https://raw.githubusercontent.com/fczuardi/bips/master/bip-0039/french.txt

getdict:android

Download the pt_BR dictionary csv from Google's Android repository

Equivalent shell:

$ mkdir -p ./downloads
$ cd ./downloads
$ curl -O https://android.googlesource.com/platform/packages/inputmethods/LatinIME/+archive/master/dictionaries.tar.gz
$ tar -zxvf dictionaries.tar.gz pt_BR* && gunzip pt_BR*.gz

csv

Write a file named lists/all.csv that is the downloaded Android dictionary filtered to display only:

  • words without any of the following letters: áéíóúâêôãõç
  • that has 5-8 letters
  • without any flags (flags=,), so, no adult or offensive words
  • sorted by the 4th column (originalFreq)

Equivalent shell:

$ cat downloads/pt_BR_wordlist.combined | \
grep -E 'word=[^áéíóúâêôãõç,]{5,8},' | \
sed -e 's:originalFreq=::' -e 's: word=::' -e 's:f=.*,::'| \
sort -t, -k2 -rn >lists/all.csv

filter:en

Write a file named lists/all-no_en.csv that is lists/all.csv without the lines containing words of downloads/english.txt

Equivalent shell:

$ cat downloads/english.txt|grep '^a'| \
sed -e 's:^\(.*\):\^\1,:' > /tmp/all_en_pattern
$ grep -vf /tmp/all_en_pattern lists/all.csv > lists/all-no_en.csv

filter:sp

Write a file named lists/all-no_en-no_sp.csv that is lists/a-no_en.csv without the lines containing words of downloads/spanish.txt

Equivalent shell:

$ cat downloads/spanish.txt|grep '^a'| \
sed -e 's:^\\(.*\\):\\^\\1,:' > /tmp/all_sp_pattern 
$ grep -vf /tmp/all_sp_pattern lists/all-no_en.csv > lists/all-no_en-no_sp.csv

filter:fr

Write a file named lists/all-no_en-no_sp-no_fr.csv that is lists/all-no_en-no_sp.csv without the lines containing words of downloads/french.txt

Equivalent shell:

$ cat downloads/french.txt|grep '^a'| \
sed -e 's:^\\(.*\\):\\^\\1,:' > /tmp/all_fr_pattern 
$ grep -vf /tmp/all_fr_pattern lists/all-no_en-no_sp.csv > lists/all-no_en-no_sp-no_fr.csv

crop10k

Write a file named lists/filtered-top-10k.csv that is the top 10,000 words in frequency from lists/all-no_en-no_sp-no_fr.csv

Equivalent shell:

$ cat lists/all-no_en-no_sp-no_fr.csv| \
head -n 10000 > lists/filtered-top-10k.csv

a-z

Writes two separated csv files for each letter from a to z, each file contains the words from that letter one sorted alphabetically (alpha-a.csv, alpha-b.csv…) and the other sorted by frequency (score-a.csv, score-b.csv…).

Equivalent shell:

$ for x in {a..z}; \
do grep "^$x" lists/filtered-top-10k.csv > lists/score-$x.csv &&\
cat lists/score-$x.csv | sort > lists/alpha-$x.csv; 
done

ptbr-wordlist's People

Contributors

fczuardi avatar

Watchers

 avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.