Storm-crawler–the scalable spider

Storm-crawler is an open source SDK for building distributed web crawlers with Apache Storm. The project is under Apache license v2 and consists of a collection of reusable resources and components, written mostly in Java.

The aim of Storm-crawler is to help build web crawlers that are :

  • scalable
  • resilient
  • low latency
  • easy to extend
  • polite yet efficient

Storm-crawler is a library and collection of resources that developers can leverage to build their own crawlers. The good news is that doing so can be pretty straightforward. Often, all you’ll have to do will be to declare storm-crawler as a Maven dependency, write your own Topology class (tip : you can extend ConfigurableTopology), reuse the components provided by the project and maybe write a couple of custom ones for your own secret sauce. A bit of tweaking to the Configuration and off you go!

Apart from the core components, we provide some external resources that you can reuse in your project, like for instance our spout and bolts for ElasticSearch or a ParserBolt which uses Apache Tika to parse various document formats.

Storm-crawler is perfectly suited to use cases where the URL to fetch and parse come as streams but is also an appropriate solution for large scale recursive crawls, particularly where low latency is required. The project is used in production by several companies and is actively developed and maintained.

The Presentations page contains links to some recent presentations made about this project.

1 comment

  1. Pingback: Achived articles from Upfan Social – Talk Scripts

发表评论

电子邮件地址不会被公开。 必填项已用*标注

Sections

Shows

Local News

Tools

About Us

Follow Us

跳至工具栏