A few more tips
You might find that Python can be very useful in a daily basis. Learn a few tips in this post.
Crowl runs on Python 3 or above. It works best on UNIX-like systems (Linux and macOS), but will run on Windows too.
To check which version of Python your system is running, open a terminal and execute the following:
python --version
You should get something like Python 3.*
. If that’s the case, you can now install Crowl.
If the output isn’t something like Python 3.*
, try this:
python3 --version
If this didn’t work either, please download and install the latest Python 3 version.
If you do have python3
installed but not as the default python
interpreter, here are your options:
We recommend using virtual environments to split your different projects dependecies and avoid conflicts.
You can for instance use pyenv.
Once pyenv
is installed, you’ll be able to quickly create environments:
mkdir crowltech
cd crowltech
pyenv virtualenv 3.6.4 crowltech
pyenv local crowltech
python --version
python3
as the default interpreterYou can replace Python 2 as the default Python interpreter on your system by using aliases.
On UNIX-like systems (Linux & macOS), edit your ~/.bash_profile
file and add the following:
alias python=python3
alias pip=pip3
Save the changes, then run:
source ~/.bash_profile
python3
We really dont advise to do so, but if you don’t want (or can’t) change your default Python interpreter, you can simply replace python
and pip
commands with respectively python3
and pip3
.
A few more tips
You might find that Python can be very useful in a daily basis. Learn a few tips in this post.
git
(recommended)We recommend using git
as it will be a lot easier to upgrade.
Simply clone the repository:
git clone https://gitlab.com/crowltech/crowl.git
cd crowl
git
If you’re not comfortable using git
, you can download a zip archive or a tar.gz archive directly.
In console:
wget https://gitlab.com/crowltech/crowl/-/archive/master/crowl-master.tar.gz
tar -xzvf crowl-master.tar.gz
mv crowl-master crowl
cd crowl
Once into the crowl
directory, install dependencies using pip
:
pip install -r requirements.txt
This will download and install all python dependencies.
You are now ready to start crawling.
Crowl can try and determine the language of content on each crawled page using Fasttext language identification model.
In order to use this feature, you need to activate it in your config file, and to download said model into the data
folder:
mkdir data
cd data
wget https://dl.fbaipublicfiles.com/fasttext/supervised-models/lid.176.bin
Copy the config.sample.ini
file to yourproject.ini
and set your own settings.
The required settings are PROJECT_NAME
and START_URL
.
The list of all configuration options is available here.
Simply launch Crowl from the command line:
python crowl.py --conf yourproject.ini
If you kept the default settings, data will be saved to CSV files.
If you installed Crowl using git
, simply download the latest version:
git pull origin master
If you didn’t use git
, save your configuration files, delete any other files and replace with those from the new version.
You might also have to update the Python dependencies by running:
pip install -r requirements.txt
Remember to checkout the release notes for the list of new features.