Light

barryzm / statistical-case-studies Goto Github PK

View Code? Open in Web Editor NEW

This project forked from snowing-st/statistical-case-studies

0.0 1.0 0.0 4.91 MB

news crawl & text analysis

Jupyter Notebook 94.66% Python 5.34%

statistical-case-studies's Introduction

Statistical-Case-Studies

应用统计案例分析课程作业
指导老师：feng.li

Lab1

1. 读取文本数据、文本数据预处理、统计词频、保存词频表至csv
- 1.1 Read a text file
- 1.2 Do the necessary cleaning
- 1.3 Convert to other format（word count）
- 1.4 Export to csv format
1. 读取鸢尾花数据、描述统计、scipy做线性代数运算
- 2.1 Read a csv flie
- 2.2 Do the description（不使用np、pd的描述统计）
- 2.3 Convert it to dataframe
- 2.4 Try some linear algebra（用scipy做矩阵转置、逆、行列式值、最小二乘、广义逆、特征值与特征向量）

Lab2

1. 用scipy包的optimize.minimize求似然函数的最大值
- 1.1 极大似然函数生成正态分布随机数np.random.normal
- 1.2 定义对数极大似然函数
- 1.3 给定参数初值
- 1.4 最小化负的对数极大似然函数scipy.optimize.minimize
1. 中文文本处理
- 2.1 读入2018年政府工作报告txt文件
- 2.2 去掉只有\n的空行，得到段落
- 2.3 按照句号和感叹号分隔，去掉\n和空格，得到句子
- 2.4 利用python正则表达式去除标点符号、数字、英文字母re.sub
- 2.5 逐行写入txt、csv文件

Lab3

用nltk和re做英文文本处理

爬取新华网Business - Finance类别的新闻url requests.get+json.loads
对每个url，爬取新闻标题及内容xpath
批量读取新闻文本txt
文本预处理nltk+re.sub
生成文本词频矩阵sklearn.feature_extraction.text.CountVectorizer
根据词频绘制词云图wordcloud

Lab4

搜索并保存“中美贸易战”不同时间发布的新闻，用jieba提取关键词

批量读取文件
提取所有文本数据
加载自定义词典jieba.load_userdict
分词jieba.cut
添加自定义词汇jieba.add_word
去除字母、数字、标点、停用词
提取关键词jieba.analyse.extract_tags

Lab5

新浪新闻搜索scrapy spider

TradeWar.py:爬取搜索列表上的新闻详细内容
TradeWarList.py:爬取搜索列表上的新闻摘要
修改了middlewares.py/items.py/settings.py

Lab6

TradeWarCrawl.py : 改写Lab5，多个不同搜索关键字并行爬取
TradeWarCrawl - class.py: 改成类的形式，但不能并行爬取新闻

Lab7

在新浪搜索爬取中兴事件的相关新闻并分时间段提取新闻观点
preprocess.py:对爬取的数据作文本预处理，提取词频，绘制词云图

Lab8

对爬取的新闻文本建立Ngram模型和word2vec模型

Probabilistic-Language-Modeling.ipynb
利用Nram模型提取二元词频、并对新闻按时间段分类
利用word2vec作简单的语义相似度探索

Lab9

对爬取的新闻文本建立动态主题模型

Dynamic Topic Models.ipynb
动态主题模型:gensim.models.ldaseqmodel
提取每个时期的关键词
动态主题模型可视化：pyLDAvis

Final Course Report

基于LDA 的招聘信息中技能要求提取与量化——以实习僧数据分析实习为例

------------------- 已完结 2018.07.06 ----------------------

statistical-case-studies's People

Contributors

Watchers

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.

Jobs