GithubHelp home page GithubHelp logo

dgrothe-phd / wordchecker Goto Github PK

View Code? Open in Web Editor NEW
0.0 1.0 0.0 50 KB

Word occurrence in text files, search for multiple search terms in one go, find wording inconsistencies in long texts

License: MIT License

Python 100.00%
spellcheck search-in-text typos

wordchecker's Introduction

Word Checker

Counts word occurrences in text files and results are saved in alphabetical order in a text file. Results are summarized for each source text file. Works best if the text files are saved as UTF-8 (with or without BOM), for example with Windows Notepad or Notepad++. The 3-byte BOM which can occur at the beginning of an UTF-8 file, is skipped so that the first word of the file is registered as a "normal word".

Some characters, especially commas, periods and parentheses after a word, are removed before a word is registered. This smoothes the vocabulary somewhat, but by keeping the punctuation inside of a word if any (such as "2.0"), typos such as missing spaces after a comma are also registered after a closing bracket. URLs or compositions with slashes are not split up so that low-level domains or word endings (such as the "s" in "file(s)") are not counted as separate words.

Usage

python WordOcc.py At the first user prompt, insert a (relative) path, and any *.txt file inside that folder is scanned.
On Windows, for example, entries such as C:\Users\FooBar\texte or ./subfolder1 both work fine. Some rudimentary programmer mode for scanning (python) scripts is also implemented.

Multiple-term search: checking only the occurrence of multiple search terms defined by the user. To avoid coding misinterpretation of non-ASCII letters by the DOS command-line input, the search words cannot be entered at the command line but have to be stored in a UTF-8 text file in advance. Then run python WordOcc.py /s .

You can use the following command line arguments.

Argument Description
--help or /? Display usage options
--prog or /p Find words, commands and class objects in programming files
--refs or /r (under construction) gather words together with reference signs, such as cable (17)
--search or /s Only search for user-specified search terms

Example:

python WordOcc.py --search

Effect

Typos and inconsistencies in the used terms in large files can be easilier found. Such as inconsistency regarding hyphen usage or using similar terms, such as "disc" and "disk", for the same object.

Next tasks

  • Some syntax highlighting for similar words

Requirements

Written in Python 3 (works with e.g. version 3.8.1), these modules are imported: sys, re, glob, os

See also https://github.com/DGrothe-PhD/WordCheckerJava


German

Das Projekt enthält ein Python-Skript, das Wörter in Textdateien zählt. In einem Ordner gespeicherte Textdateien werden von Python ausgelesen und die ausgewerteten Wortlisten gespeichert. Die verwendeten Textdateien bleiben dabei unverändert. Kommen manche Wörter mehrfach vor, so erkennt man das an der Zahl. Die Zahl ist tabstoppgetrennt hinter dem jeweiligen Wort angegeben. Insofern kann eine solche Auswertung auch dazu dienen, die Wortliste weiterzuverarbeiten, indem man sie mit einer Spreadsheet-Anwendung, z. B. Excel, öffnet, wo das Ergebnis dann weiter analysiert werden kann.

Zielsetzung

Es soll eine Schnelldurchsicht nach verwendeten Wörtern ermöglicht werden. Uneinheitlichkeiten (denselben Begriff mal mit Bindestrich und mal ohne) oder Tippfehler sowie OCR-Erkennungsfehler lassen sich leichter auffinden.

Funktionsweise

Es werden einige Zeichen, vor allem Kommas, Punkte sowie außen stehende Klammern, entfernt, bevor ein Wort registriert wird. Dies glättet den Wortschatz etwas, durch Beibehaltung der Zeichen, die im Wort auftauchen, werden allerdings auch Tippfehler wie fehlende Leerzeichen nach einer schließenden Klammer registriert. Andererseits werden auch die URLs unverändert belassen, damit etwa Low-Level-Domains nicht als Einzelwortbrocken mitzählen.

Um auch europäische Sprachen außer Englisch verwenden zu können, speichert man die Textdateien am besten im UTF-8-Format, dies ermöglicht der Windows bekannte Editor in den heutigen Windows-Versionen üblicherweise durch einen einfachen Mausklick. Apostrophs und Wörter mit Nicht-ASCII-Buchstaben (étagère, ça) sollten so ebenfalls korrekt verarbeitet werden.

Siehe auch https://github.com/DGrothe-PhD/WordCheckerJava

wordchecker's People

Contributors

dgrothe-phd avatar

Watchers

 avatar

wordchecker's Issues

Reference sign filtering

  • Declare the mode
  • Implement the algorithm
  • Highlighting possible inconsistencies (different reference signs for same word)

Bug: special Unicode characters

Observed

Unicode decoding bugs some characters may raise an exception if (some) emoticons are used.
An example error message:
File "C:\Python\lib\encodings\cp1252.py", line 23, in decode return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x8d in position 3953: character maps to <undefined>

Expected

silently ignoring or collecting those unicode icons.

Find a search term within a word

Currently search terms are only found if the whole word equals the search term. Search terms should be searched within words as well. Such as "card" within "cards", etc.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.