GithubHelp home page GithubHelp logo

dumpmemory / 2c Goto Github PK

View Code? Open in Web Editor NEW

This project forked from howie6879/liuli

0.0 0.0 1.0 115.47 MB

构建一个多源(公众号、RSS)、干净、个性化的阅读环境

License: Apache License 2.0

Python 83.61% Shell 0.97% Jupyter Notebook 5.45% HTML 9.32% Dockerfile 0.64%

2c's Introduction

Liuli

📖 构建一个多源、干净、个性化的阅读环境

琉璃开净界,薜荔启禅关

✨ 特性

使用Liuli,你可以得到(都是Flag):

  • 配置化开发,自定义输入、处理、输出
  • 信息备份(支持跨源): Github, MongoDB
  • 机器学习赋能:验证码识别、广告分类、智能标签
  • 阅读源管控,构建知识管理平台
  • 官方案例技术支持

使用场景:

🍥 使用

教程[使用前必读]:

快速开始,请先确保安装Docker

mkdir liuli && cd liuli
# 数据库目录
mkdir mongodb_data
# 配置 pro.env 具体查看 doc/02.环境变量.md
vim pro.env
# 下载 docker-compose
wget https://raw.githubusercontent.com/howie6879/liuli/main/docker-compose.yaml
# 启动
docker-compose up -d

代码安装使用过程如下:

# 确保有Python3.7+环境
git clone https://github.com/howie6879/liuli.git
cd liuli

# 创建基础环境
pipenv install --python={your_python3.7+_path}  --skip-lock --dev
# 配置.env 具体查看 doc/02.环境变量.md 启动调度
pipenv run dev_schedule

启动成功日志如下:

Loading .env environment variables...
[2021:12:23 23:08:35] INFO  Liuli Schedule started successfully :)
[2021:12:23 23:08:35] INFO  Liuli Schedule time: 00:00 06:00
[2021:12:23 23:09:36] INFO  Liuli playwright 匹配公众号 老胡的储物柜(howie_locker) 成功! 正在提取最新文章: 我的周刊(第018期)
[2021:12:23 23:09:39] INFO  Liuli 公众号文章持久化成功! 👉 老胡的储物柜
[2021:12:23 23:09:40] INFO  Liuli 🤗 微信公众号文章更新完毕(1/1)

推送效果如图:

🤔 实现

大概流程如下:

liuli_process

简单解释一下:

  • 采集器:监控各自关注的公众号或者博客源等自定义阅读源,最终构建Feed流作为输入源;
  • 处理器:对目标内容进行自定义处理,如基于历史广告数据,利用机器学习实现一个广告分类器(可自定义规则),或者自动打标签等;
  • 分发器:依靠接口层进行数据请求&响应,为使用者提供个性化配置,然后根据配置自动进行分发,将干净的文章流向微信、钉钉、TG甚至自建网站都行。

这样做就实现了干净阅读环境的构建,衍生一下,基于获取的数据,可做的事情有很多,大家不妨发散一下思路。

🤖 帮助

为了提升模型的识别准确率,我希望大家能尽力贡献一些广告样本,请看样本文件:.files/datasets/ads.csv,我设定格式如下:

title url is_process
广告文章标题 广告文章连接 0

字段说明:

  • title:文章标题
  • url:文章链接,如果微信文章想、请先验证是否失效
  • is_process:表示是否进行样本处理,默认填0即可

来个实例:

liuli_ads_csv_demo

一般广告会重复在多个公众号投放,填写的时候麻烦查一下是否存在此条记录,希望大家能一起合力贡献,亲,来个 PR 贡献你的力量吧!

👀 致谢

感谢以下开源项目:

  • Flask: web 框架
  • Ruia: 异步爬虫框架
  • playwright: 使用浏览器进行数据抓取
  • CharCNN: 感谢CharCNN论文作者Xiang Zhang, Junbo Zhao, Yann LeCun

您任何PR都是对Liuli项目的大力支持,非常感谢以下开发者的贡献(排名不分先后):

👉 关于

欢迎一起交流(关注入群):

img

2c's People

Contributors

aixiaofour avatar baboon-king avatar howie6879 avatar leslieleung avatar ruiruizhou avatar zyd16888 avatar

Forkers

elgohr-update

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.