Configuration

Several configuration options are available, and can be set in your project’s .ini file.

You can (or should) have multiple .ini files for each project you’re working on. The aim is to easily switch between projects just by swapping configuration files, without having to remember every detail of your settings:

python crowl.py --conf project1.ini  
python crowl.py --conf project2.ini  
...  

Here are the currently available settings:

General project settings

The only required settings are located in the [PROJECT] section, and are pretty simple:

PROJECT_NAME

The name of the project. Will be used to generate timestamped names for outputs (either CSV files or database name).

START_URL

The start url of your crawl (usualy the homepage).

Crawler settings

The [CRAWLER] section gathers all settings regarding the way Crowl makes its requests (HTTP headers, more or less).

USER_AGENT

Defaults to Crowl (+https://www.crowl.tech/) but you can set whatever you want here.

ROBOTS_TXT_OBEY

Should we respect the robots.txt file? Defaults to True.

DOWNLOAD_DELAY

Time to wait between requests (in seconds). Default: 0.5.

CONCURRENT_REQUESTS

Number of crawler threads running at the same time. Default: 5.

MIME_TYPES

Which mime types should Crowl ask for?
Defaults to text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8.
You probably won’t need to change this setting.

ACCEPT_LANGUAGE

Which language(s) should Crowl ask for in the HTTP headers?
Default is en but you could use more complex strings, such as fr-FR,fr;q=0.9,en-US;q=0.8,en;q=0.7.

Data extraction settings

You’ll find the content extraction options in the [EXTRACTION] section.

Should Crowl store links? Defaults to False to save storage space.
See this page of the docs for more details.

CONTENT

Should Crowl store html source code? Defaults to False to save storage space.
See this page of the docs for more details.

DEPTH

How deep should Crowl crawl your website? Defaults to 5.
Remember that Crowl crawls every page from a depth level before going to the next.

Output modes

As of v0.2, Crowl offers multiple output modes (called pipelines), set in the [OUTPUT] section.

There are currently two pipelines available: CSV is the default one.
Simply activate or deactivate a pipeline by commenting it (adding a # at the begining of the line).

The number associated with each pipeline determines its priority: smaller priority pipelines will run first if you happen to use multiple pipelines at the same time.

CSV

crowl.CrowlCsvPipeline: exports data to CSV files:

  • one for urls,
  • the other for links.

See this page of the docs for additionnal information.

MySQL

crowl.CrowlMySQLPipeline: exports data to a MySQL database, with two tables:

  • one for urls,
  • the other for links.

For this pipeline to work, you’ll have to fill the [MYSQL] section of the file too:

  • MYSQL_HOST is your MySQL server’s host (probably localhost),
  • MYSQL_PORT is your MySQL server’s port (probably 3306),
  • MYSQL_USER is the username you use to connect to the MySQL server,
  • MYSQL_PASSWORD is the user’s password.

See this page of the docs for additionnal information.

Get Connected