GithubHelp home page GithubHelp logo

debaser990 / contentextraction Goto Github PK

View Code? Open in Web Editor NEW

This project forked from feisun/contentextraction

0.0 1.0 0.0 120 KB

Content Extraction via Text Density (SIGIR11)

QMake 1.45% C++ 86.61% C 11.94%

contentextraction's Introduction

#Content Extraction via Text Density (CETD)

Introduction

This program is developed to detect and remove the additional content (e.g. ads, navigation menus, copyright notices etc) around the main content of a webpage.

Before using the source code, make sure you have already installed QT sdk.

Contact: Fei Sun, Institute Of Computing Technology, [email protected], 
Project page: http://ofey.me/projects/cetd/

Citation

@inproceedings{Sun:2011:DBC:2009916.2009952,
author = {Sun, Fei and Song, Dandan and Liao, Lejian},
title = {DOM based content extraction via text density},
booktitle = {Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval},
series = {SIGIR '11},
year = {2011},
isbn = {978-1-4503-0757-4},
location = {Beijing, China},
pages = {245--254},
numpages = {10},
url = {http://doi.acm.org/10.1145/2009916.2009952},
doi = {10.1145/2009916.2009952},
acmid = {2009952},
publisher = {ACM},
address = {New York, NY, USA},
keywords = {composite text density, content extraction, densitysum, text density},
}

##License

The GPL version 3, read it at http://www.gnu.org/licenses/gpl.txt

contentextraction's People

Contributors

feisun avatar

Watchers

James Cloos avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.