Several configuration options are available, and can be set in your project’s .ini file.

You can (or should) have multiple .ini files for each project you’re working on. The aim is to easily switch between projects just by swapping configuration files, without having to remember every detail of your settings:

python --conf project1.ini  
python --conf project2.ini  

Here are the currently available settings:

General project settings

The only required settings are located in the [PROJECT] section, and are pretty simple:


The name of the project. Will be used to generate timestamped names for outputs (either CSV files or database name).


The start url of your crawl (usualy the homepage).

Crawler settings

The [CRAWLER] section gathers all settings regarding the way Crowl makes its requests (HTTP headers, more or less).


Defaults to Crowl (+ but you can set whatever you want here.


Should we respect the robots.txt file? Defaults to True.


Regex pattern to identify URLs that should not be crawled.
If left empty , any available URL will be crawled (if allowed by ROBOTS_TXT_OBEY). This is the default setting.

Otherwise, any URL matching the regex will be ignored. Links to this URL will still be stored thought.

Example: use \/tag\/ if you want to ignore any URL containing a /tag/ pattern.


Time to wait between requests (in seconds). Default: 0.5.


Number of crawler threads running at the same time. Default: 5.


Which mime types should Crowl ask for?
Defaults to text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8.
You probably won’t need to change this setting.


Which language(s) should Crowl ask for in the HTTP headers?
Default is en but you could use more complex strings, such as fr-FR,fr;q=0.9,en-US;q=0.8,en;q=0.7.

Data extraction settings

You’ll find the content extraction options in the [EXTRACTION] section.

Should Crowl store links? Defaults to False to save storage space.
See this page of the docs for more details.


Should Crowl store html source code? Defaults to False to save storage space.
See this page of the docs for more details.


How deep should Crowl crawl your website? Defaults to 5.
Remember that Crowl crawls every page from a depth level before going to the next.

Output modes

As of v0.2, Crowl offers multiple output modes (called pipelines), set in the [OUTPUT] section.

There are currently two pipelines available: CSV is the default one.
Simply activate or deactivate a pipeline by commenting it (adding a # at the begining of the line).

The number associated with each pipeline determines its priority: smaller priority pipelines will run first if you happen to use multiple pipelines at the same time.


crowl.CrowlCsvPipeline: exports data to CSV files:

  • one for urls,
  • the other for links.

See this page of the docs for additionnal information.


crowl.CrowlMySQLPipeline: exports data to a MySQL database, with two tables:

  • one for urls,
  • the other for links.

For this pipeline to work, you’ll have to fill the [MYSQL] section of the file too:

  • MYSQL_HOST is your MySQL server’s host (probably localhost),
  • MYSQL_PORT is your MySQL server’s port (probably 3306),
  • MYSQL_USER is the username you use to connect to the MySQL server,
  • MYSQL_PASSWORD is the user’s password.

See this page of the docs for additionnal information.

Get Connected