GithubHelp home page GithubHelp logo

dario-github / notion-nlp Goto Github PK

View Code? Open in Web Editor NEW
12.0 3.0 2.0 11.22 MB

Read the text from a Notion database and perform NLP analysis.

License: MIT License

Python 74.05% Shell 2.50% Batchfile 23.45%
flomo nlp notion text-analysis text-summarization tf-idf notion-api notion-database python

notion-nlp's Introduction

Notion Rich Text Data Analysis

Notion NLP

To read text from a Notion database and perform natural language processing analysis.

Tests Passing codecov visitors visitors

English / 简体中文

Introduction

To achieve functionality similar to flomo, I have created a database using Notion, where I have recorded my thoughts and insights over the years, accumulating a rich corpus of language. However, the random roaming feature of flomo did not meet my needs, so I decided to develop a small tool that integrates with the Notion API and performs NLP analysis.

Now, the tool can:

  • Output intuitive and visually appealing word cloud images.

  • Generate thematic summaries of your Notion notes.

    📝 Example thematic summary

  • Support multiple languages. I have added stopword lists for several languages including Chinese, English, Russian, French, Japanese, and German. Users can also customize their own stopword lists.

    🌏 Stopword lists for multiple languages

  • Support multiple tasks. Users can configure multiple databases and corresponding filtering and sorting conditions to create rich analysis tasks.

    🔍 Example configuration file

    For example, I have added the following tasks:

    • 🤔 Reflections from the past year
    • 🚩 Optimization of annual summaries for the current year
    • ⚠️ Self-admonitions from all time periods

I am pleased to share this tool and hope it can be helpful to you. 😆

Pipline

flowchart TB
A1[(Get personal access token from note-taking software)]
B2[Customize NLP module parameters]
B3[Customize visualization module parameters]
B1[Configure API key and corresponding database ID]
C1((Run task))
D1([Read rich text via API]) 
D2([Segment/Clean/Build word-sentence mapping])
D3[/Calculate TF-IDF/]
E1{{Markdown of keywords and source sentences}}
E2{{Word cloud with multiple color styles}}

  A1 -- Configure_task_parameters --> Parameter_types
  subgraph Parameter_types
  B1
  B2
  B3
  end
  B1 & B2 & B3 --> C1
  C1 --> Calculation_module
  subgraph Calculation_module
  D1 --> D2 --> D3
  end
  Calculation_module --> Visualization_module
  subgraph Visualization_module
  E1
  E2
  end

Installation and Usage

  • Windows System

    Download the latest release of the Windows version zip file, extract it, and double-click start.bat to follow the script prompts and start experiencing.

  • Linux System

    • Method 1: Download the latest release of the Linux version zip file, extract it, open the terminal in the current directory, and enter ./notion-nlp-linux --help to view the command details.

    • Method 2: Install the package from PyPI to the Python environment.

      pip install notion-nlp

Configure Tasks

  • Configuration file reference config.sample.yaml (hereinafter config, please rename to config.yaml as your own configuration file)

Get the integration token

  • In notion integrations create a new integration, get your own token and fill in the token in the config.yaml file afterwards.

    Graphic Tutorial: tango / markdown

Add integration to database/get database ID

  • If you open the notion database page in your browser or click on the share copy link, you will see the database id in the address link (similar to a string of jumbles) and fill in the database_id under the task of config.

    Graphic Tutorial: tango / markdown

Configure the filter sort database entry extra parameter

Run all tasks

  • Select "Run all tasks" through Windows interactive script

  • Call using terminal or Python code after installing the package from PyPI

    • Run from command line

      python3.8 -m notion_nlp run-all-tasks --config-file /path/to/your/config/file
    • Run from Python code

      from notion_nlp import run_all_tasks
      config_file = "./notion-nlp-dataset/configs/config.yaml"
      run_all_tasks(config_file)

Run a single task

  • Select "Run specified task" through Windows interactive script

  • Call using terminal or Python code after installing the package from PyPI

    • In the run_task command, you can specify the task in several ways, including:

      • task: an instance of TaskParams;
      • task_json: a JSON string representing the task information;
      • task_name: the name of the task.
    • If config_file exists, you can use task_name to specify the task. Note that the task needs to be activated, otherwise an exception will be thrown. If config_file does not exist, you need to provide a token and either TaskParams or task_json.

      • With an existing config file, pass in task name/task json/task parameter class

        • Run from command line

          # Option 1
          python3.8 -m notion_nlp run-task --task-name task_1 --config-file /path/to/your/config/file
          
          # Option 2
          python3.8 -m notion_nlp run-task --task-json '{"name": "task_1", "database_id": "your_database_id"}' --config-file /path/to/your/config/file
        • Run from Python code

          from notion_nlp import run_task
          task_name = "task_1"
          database_id = "your_database_id"
          config_file="./notion-nlp-dataset/configs/config.yaml"
          
          # Option 1
          run_task(task_name=task_name, config_file=config_file)
          
          # Option 2 (not recommended for Python code)
          import json
          task_info = {"name": task_name, "database_id": database_id}
          run_task(task_json=json.dumps(task_info, ensure_ascii=False), config_file=config_file)
          
          # Option 3 (recommended)
          from notion_nlp.parameter.config import TaskParams
          task = TaskParams(name=task_name, database_id=database_id)
          run_task(task=task, config_file=config_file)
      • Without a config file, pass in token and task json/task parameter class

        • Run from command line

          # Option 1
          python3.8 -m notion_nlp run-task --task-json '{"name": "task_1", "database_id": "your_database_id"}' --token 'your_notion_integration_token'
        • Run from Python code

          from notion_nlp import run_task
          task_name = "task_1"
          database_id = "your_database_id"
          notion_token = "your_notion_integration_token"
          
          # Option 1 (not recommended for Python code)
          import json
          task_info = {"name": task_name, "database_id": database_id}
          run_task(task_json=json.dumps(task_info, ensure_ascii=False), token=notion_token)
          
          # Option 2 (recommended)
          from notion_nlp.parameter.config import TaskParams
          task = TaskParams(name=task_name, database_id=database_id)
          run_task(task=task, token=notion_token)

Enhance Personal Experience

🛃 Custom Stopword List

  • Add a text file in the stopwords directory with the suffix stopwords.txt, such as custom.stopwords.txt. Each stopword should be on a separate line in the file.

📝 Share Your Ideas with the Author

💝 Buy the author a cup of coffee and request a personalized customization. Donate PayPal

Development

  • Welcome to fork and add new features/fix bugs.

  • After cloning the project, use the create_python_env_in_new_machine.sh script to create a Poetry virtual environment.

  • After completing the code development, use the invoke command to perform a series of formatting tasks, including black/isort tasks added in task.py.

    invoke check
  • After submitting the formatted changes, run unit tests to check coverage.

    poetry run tox
    

Note

  • The word segmentation tool has two built-in options: jieba/pkuseg. (Considering adding language analysis to automatically select the most suitable word segmentation tool for that language.)

    • jieba is used by default.
    • pkuseg cannot be installed with poetry and needs to be installed manually with pip. In addition, this library is slow and requires high memory usage. It has been tested that a VPS with less than 1G memory needs to load virtual memory to use it.
  • The analysis method using tf-idf is too simple. Consider integrating the API of LLM (such as openAI GPT-3) for further analysis.

Contributions

License and Copyright

  • MIT License
    1. The MIT License is a permissive open-source software license. This means that anyone is free to use, copy, modify, and distribute your software, as long as they include the original copyright notice and license in their derivative works.

    2. However, the MIT License comes with no warranty or liability, meaning that you cannot be held liable for any damages or losses arising from the use or distribution of your software.

    3. By using this software, you agree to the terms and conditions of the MIT License.

Contact information

notion-nlp's People

Contributors

dario-github avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

Forkers

nouth-wang xchyyb

notion-nlp's Issues

基于文档的关系子图构建

对文本进行关系抽取,提取一个词关联的其他词,构成一张doc-word的关系图,dock间通过word关联,同篇doc的word互相关联,通过计算graph embedding,得到一个词向量矩阵,可以用来挖掘一个词的关系子图。

README优化

  • markdown转为图片,裁剪截取450*800放入readme

修复CI配置的bug

pytest测试通过却最终返回1,查查原因

___________________________________ summary ____________________________________
ERROR:  py38: InterpreterNotFound: python3.8
  py39: commands succeeded
  py310: commands succeeded
Error: Process completed with exit code 1.

搜索tox的配置文件如何只运行当前有的环境

完善README

  • 如何添加notion integration到指定页面

增强调用支持

命令行使用还是不太方便,故增强python库调用

  • 项目重命名
  • 添加主函数调用
  • 添加参数文件API

添加多语种的停用词

Support Languages

Available languages
Arabic
Bulgarian
Catalan
Czech
Danish
Dutch
English
Finnish
French
German
Gujarati
Hindi
Hebrew
Hungarian
Indonesian
Malaysian
Italian
Norwegian
Polish
Portuguese
Romanian
Russian
Slovak
Spanish
Swedish
Turkish
Ukrainian
Vietnamese
Persian/Farsi

Contribution

Thanks to https://github.com/Alir3z4/stop-words

💡 思路收集

📝 更新或撰写文档

  • 缩小图片
    • 按不同平台页面自适应
  • 赞助链接
  • #38
  • 修正资源文件超链的路径
  • #39
    • 复制database链接给new bing提问
    • 直接说需要notion API,描述自己需求

Refer #31 #35 (comment)

  • 抓取胡锡进微博进行分析
  • #40
  • 抓取周杰伦的歌词进行分析 vs 林俊杰
  • 胡锡进年度主题随着时间线的得分变化(纵向的条形图gif)
  • 周杰伦歌词主题随着时间线的得分变化

🐛 修复 bug

✨ 添加新功能或增强现有功能

增加支持的应用API

Online

  • notion
  • #41
  • Wolai
  • 语雀
  • Evernote
  • Day One (日记软件)
  • Roam Research
  • Milanote

Offline

  • Obsidian
  • Craft (Mac)
  • 熊掌记

增强NLP模块

  • 重新选用分词工具
  • 主题建模
  • #42
  • 可视化
    • 词云
      • 调用在线预训练模型API,计算top-n关键词和black/white的余弦距离,哪个更近就设为词云图的背景色
    • 主题簇图
    • 情感分析图
  • jieba的搜索引擎模式(尽量挖掘长词,所有组合都列出来)
  • 标签对齐,语义相近的标签聚类
  • 私人功能:接入openAI API,fine-tune一个自己的model,将自己Reading List收藏的电子书+ 待读文章 + Readwise 标注,变成了一个问答型的阅读助理来浓缩未读文章和书籍,按照自己的知识范围,把值得更新的内容按由简到难的顺序列出来,提高效率
    https://twitter.com/XDash
    截图

🎨 更新 UI 或视觉效果

  • 在本地新建参数文件和输出测试内容
  • 命令行交互添加token和database id等参数
  • 支持windows
  • bash脚本
  • 邮件订阅
  • 快速部署
  • 用户可能想要对比不同来源或时间段的文本数据,看看它们之间有什么异同或变化。
  • 用户可能想要自定义一些参数或选项,比如选择不同的语言、主题、情感等。
  • 用户可能想要导出或分享文本分析的结果,比如保存为PDF、Word等格式,或者发送给其他人。
  • #43
  • 加个爬虫功能组,爬下来的社交媒体数据直接存notion数据库,打好标签,生成任务参数,然后从notion读数据分析就行

🚀性能改进

  • #30
  • 按IP地址转换下载源,国内从稳定地址下载

✅增加或更新测试

🔥删除代码或文件

🎉重要里程碑或版本发布

  • v1.0.5.1
  • v1.0.6
  • v1.1.0

✨ 增强NLP模块

  • 重新选用分词工具
  • 主题建模
  • 情感分析
  • 可视化
    • 词云
      • 调用在线预训练模型API,计算top-n关键词和black/white的余弦距离,哪个更近就设为词云图的背景色
    • 主题簇图
    • 情感分析图
  • jieba的搜索引擎模式(尽量挖掘长词,所有组合都列出来)
  • 标签对齐,语义相近的标签聚类
  • 私人功能:接入openAI API,fine-tune一个自己的model,将自己Reading List收藏的电子书+ 待读文章 + Readwise 标注,变成了一个问答型的阅读助理来浓缩未读文章和书籍,按照自己的知识范围,把值得更新的内容按由简到难的顺序列出来,提高效率
    https://twitter.com/XDash
    截图

解决unzip_webfile的bug

src/notion_nlp/core/task.py:195: in run_all_tasks
    run_task(task=task, token=config.notion.token)
src/notion_nlp/core/task.py:156: in run_task
    stopwords = load_stopwords(stopfiles_dir, stopfiles_postfix, download_stopwords)
src/notion_nlp/parameter/utils.py:37: in load_stopwords
    unzip_webfile(params.multilingual_stopwords_url, stopfiles_dir)
src/notion_nlp/parameter/utils.py:124: in unzip_webfile
    with urllib.request.urlopen(url) as response:

✨ 提升易用性

  • 在本地新建参数文件和输出测试内容
  • 命令行交互添加token和database id等参数
  • 支持windows
  • bash脚本
  • 邮件订阅
  • 快速部署
  • 用户可能想要对比不同来源或时间段的文本数据,看看它们之间有什么异同或变化。
  • 用户可能想要自定义一些参数或选项,比如选择不同的语言、主题、情感等。
  • 用户可能想要导出或分享文本分析的结果,比如保存为PDF、Word等格式,或者发送给其他人。
  • 支持通过交互脚本添加任务

📝 补充README

  • 缩小图片
    • 按不同平台页面自适应
  • 赞助链接
  • 对windows的支持
  • 修正资源文件超链的路径
  • 增加extra配置的tango教程(向LLM提问或者自己配置)
    • 复制database链接给new bing提问
    • 直接说需要notion API,描述自己需求

Refer #31
选择合适的语料:

可以选择一些用户感兴趣或者熟悉的语料,比如社交媒体、新闻、博客等,让用户能够看到自己或者他人的文本数据被分析出来的结果。也可以选择一些有趣或者有挑战性的语料,比如诗歌、歌词、小说等,让用户能够发现一些意想不到或者有创意的内容。

  • 抓取胡锡进微博进行分析
  • 抓取Elon Musk 的 Twitter进行分析
  • 抓取周杰伦的歌词进行分析 vs 林俊杰

使用合适的方法:

可以使用一些吸引人或者易懂的方法来展示你工具的功能和效果,比如使用图表、图像、云图等可视化工具 12,让用户能够直观地看到文本分析的结果。也可以使用一些互动或者游戏化的方法 34,让用户能够参与到文本分析的过程中,比如输入自己想要分析的文本、选择不同的参数或选项、对比不同来源或时间段的文本数据等。

  • 胡锡进年度主题随着时间线的得分变化(纵向的条形图gif)
  • 周杰伦歌词主题随着时间线的得分变化

🎉 v1.0.6 优化计划

🚀 性能

🎨 界面

  • 增加隐藏日志的选项
  • 增加总任务进度条
  • 加强执行完毕后对结果文件路径的指引(脚本echo)
  • 增加日志文本类,日志按照系统语言自动切换输出语种(算了,英文就足够了,谁会关心中文日志呢?)

✨ 功能

  • 计分时把零值全去掉再平均
  • 今日幸运事件改为桃心、金钱或者星星
  • tfidf的结果也按任务分文件夹存储

🐛 修复

  • 修正日志message的英文语法错误
  • 删除message中的重复时间记录
  • 解决results路径在日志中没有打印的问题(first-try时最后一条日志)

📝 文档

  • 删除多余的=

🔥 文件

  • 删除冗余文件

notebook调用时文件被存在了sitepage目录下

[2023-02-22 10:36:51.791] [INFO] [1273672] [nlp.py] [274] [en_unit_testing_task result markdown have been saved to /root/.cache/pypoetry/virtualenvs/jupyter-env-tZaIskfJ-py3.8/lib/python3.8/results]
[2023-02-22 10:37:23.105] [INFO] [1273672] [nlp.py] [281] [word cloud plot saved to /root/.cache/pypoetry/virtualenvs/jupyter-env-tZaIskfJ-py3.8/lib/python3.8/results/word_cloud]

需要修改对输出目录的判断逻辑,不能和代码文件所在目录绑定

添加colab支持

Todo

  • 提供最简化的colab notebook
  • 提供完整可以走通的demo

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.