Boilerpipe

From Brede Wiki
Jump to: navigation, search
Software (help)
Boilerpipe
Description: Extraction of content from HTML web-pages
Developer: Christian Kohlschütter
Language: Java
License: Apache License 2.0
Database(s):
Feature(s): Content extraction

boilerpipe is software for extracting the content and removing the 'clutter' on HTML web pages. The method is described in Boilerplate detection using shallow text features. It is developed by Christian Kohlschütter.

boilerpipe is also available as a web applikation from:

http://boilerpipe-web.appspot.com/

[edit] Installation

wget http://boilerpipe.googlecode.com/files/boilerpipe-1.2.0-bin.tar.gz
sudo mv boilerpipe-1.2.0-bin.tar.gz /opt/
cd /opt/
sudo tar vfxz boilerpipe-1.2.0-bin.tar.gz
sudo ln -s boilerpipe-1.2.0/ boilerpipe

[edit] Related software

Personal tools