Skip to main content

ScrapingPros Client for Python


A client for interacting with the ScrapingPros API. The client provides an interface to manage projects, batches, and jobs for web scraping through the ScrapingPros service. The API offers various scraping modes, including HTTP requests, Selenium, and Puppeteer.

For more information, visit ScrapingPros Console.


Installation

To install the dependencies for this package, you may need to install requests and then install the scrapingpros package:

pip install requests
pip install scrapingpros

Classes


ScrapMode

Constants to define the method for scraping.

  • HTTP_REQUEST: Use HTTP requests for scraping.
  • SELENIUM: Use Selenium for browser automation.
  • PUPPETEER: Use Puppeteer for headless browsing.

ProjectManager

Class for managing project-related operations.

Methods

Instanciate a ProjectManager

ProjectManager(api_key: str, verbose: bool)

Creates a ProjectManager.

  • Params
    • api_key: API key found after logging using API endpoint /login or at ScrapingPros at Home profile page after logging in.
    • verbose: Default set as True.

Project creation

create_project(name: str, priority: int, description: str)

Creates a new project with a specified priority and optional description.

  • Params
    • name.
    • priority.
    • description: Set as None by default.
  • Returns
    • Project object.

Close project

close_project(project_id: int)

Closes a project and returns a confirmation.

  • Params
    • project_id.
  • Returns
    • Dict containing message of confirmation or failure.

Batch

Represents a batch of jobs associated with a project.

Attributes

  • batch_id: Unique identifier for the batch.
  • jobs: List of jobs in the batch.

Methods

Batch must be created previously by using create_batch from Project.

Get a Batch

Batch(api_key: str, batch_id: int, verbose: bool)

Gets a batch object.

  • Params
    • api_key: API key found after logging using API endpoint /login or at ScrapingPros at Home profile page after logging in.
    • batch_id.
    • verbose: Default set as True.
  • Returns
    • Batch object.

Scrap Modes information

get_scrap_modes_info()

Fetches and returns scrap mode information and job information (not html data).

  • Returns
        {
    "result": {
    "S": { # Status (S for Success)
    "HTTP_REQUEST": {
    "count": 10,
    "cost": 5.0
    },
    "SELENIUM": {
    "count": 5,
    "cost": 7.5
    }
    },
    "P": { # Status (P for Pending)
    "HTTP_REQUEST": 3,
    "SELENIUM": 2
    }
    }
    }

Analyze Scrap Modes

analyze_scrap_modes()

Provides a summary of job success, costs, and pending jobs.

  • Returns
        {
    "successful_jobs": {
    "http_request": {
    "count": 1,
    "cost": 1
    }
    },
    "pending_jobs": {
    "http_request": 1
    },
    "total_cost": 1,
    "total_jobs": 2
    }

Appends jobs to batch

append_jobs(job_urls: list[str], arguments: dict[str, str], scrap_mode: ScrapMode, actions: list[Action], cookies: CookieJar)

Adds jobs to the batch using a specified scrape mode.

  • Params
    • jobs_urls: List containing URLs to scrap.
    • arguments: Arguments for the jobs. Example: {"scraper":"detail_job"}.
    • scrap_mode: ScrapMode.
    • actions: A list of actions to perform on each page. The actions will be executed in the order they appear in the list. See Actions.
    • cookies: The cookies to use for the request. See Cookies
  • Returns
    • Does not have a default value.

Run batch

run()

Initiates batch execution with retries if necessary.

Get data from batch

This function retrieves data from all the jobs in the batch. It does not wait for all jobs to finish.

The function provides two options:

  • Retrieve data as a compressed .zip file.
  • Retrieve only the HTML data.

Parameters:

  • html_only (bool):
    • If True, it retrieves only the HTML content from the jobs.
    • If False, it downloads a .zip file with the results.
    • Default value: False.
  • save_path (str, optional):
    • The path where the .zip file will be saved if html_only is False.
    • If not provided, the file will be saved in the current directory.
    • This parameter is ignored if html_only is True.

Returns:

  • A dictionary containing either the downloaded file or the JSON data, depending on the selected option.
    • If html_only is True, it returns a dictionary with the JSON data from the batch.
    • If html_only is False, it returns a dictionary with the path to the downloaded .zip file.

Exceptions:

  • Raises a ValueError if save_path is provided but is not a valid directory.

Example usage:

# Retrieve only HTML data
data = get_data(html_only=True)

# Download the results as a ZIP file
data = get_data(html_only=False, save_path='/path/to/folder')
- **Returns**
- Retrieves data download from all jobs in the batch.


***
## Project
Represents a group of batches related to a project.

### Attributes
* ***project_id***: Unique identifier for the project.

### Methods

#### Get a Project
```python
Project(api_key: str, project_id: int, verbose: bool)

Gets a Project object. Project must be created previously.

  • Params
    • api_key: API key found after logging using API endpoint /login or at ScrapingPros at Home profile page after logging in.
    • project_id.
    • verbose: Default set as True.
  • Returns
    • Project object.

Create a batch for the project

create_batch(batch_name: str)

Creates a batch under the project.

  • Params
    • batch_name.
  • Returns
    • Batch object.

Get all batches from project

get_batches()
  • Returns
    • A list containing all batches from project.

Get projects summary

get_projects_summary(api_key: str, verbose: bool)

Retrieves a summary of all projects for the user.

  • Params
    • api_key: API key found after logging using API endpoint /login or at ScrapingPros at Home profile page after logging in.
    • verbose: Default set as True.
  • Returns
        {
    "summary": [
    {
    "project_id": 123,
    "name": "Project Name",
    ...other project details...
    },
    ...
    ]
    }

Example

from scraping_pros import ScrapMode, ProjectManager

# Initialize API client
api_key = "your_api_key"
manager = ProjectManager(api_key)

# Create a new project
project = manager.create_project("Test Project", priority=1, description="Example project")

# Create a batch under the project
batch = project.create_batch("Test Batch")

# Add jobs to the batch (without Actions and Cookies)
batch.append_jobs(["https://example.com/page1", "https://example.com/page2"], {"scraper": "list_jobs"}, ScrapMode.HTTP_REQUEST, None, None)

# Run the batch
batch.run()

# Retrieve data from completed jobs
data = batch.get_data()
print(data)

Actions

The Actions module provides different types of interactions for the job to execute. Each action represents a specific operation, such as clicking an element, scrolling, typing text, or waiting. These will be the actions to perform during the scraping process and will be passed as a list of Actions as a parameter to the append_jobs function. The actions will be executed in the order they appear in the list.

ActionMode

The available action modes are:

  • CLICK: Clicks on an element identified by an XPath.
  • SCROLLTOBOTTOM: Scrolls to the bottom of the page.
  • TYPEONINPUT: Types a given text into an input field identified by an XPath.
  • WAIT: Waits for a specified time (in milliseconds).

Usage

To create an action, use the Action.create method:

from actions import Action, ActionMode

# Click action
click_action = Action.create(ActionMode.CLICK, xpath="//button[@id='submit']", wait_navigation=True)

# Scroll to bottom
scroll_action = Action.create(ActionMode.SCROLLTOBOTTOM)

# Type in an input field
type_action = Action.create(ActionMode.TYPEONINPUT, xpath="//input[@name='email']", text="user@example.com")

# Wait for 2000 milliseconds
wait_action = Action.create(ActionMode.WAIT, time=2000)

How to send actions for the Scraper


Cookies

The Cookie class represents an individual cookie with a name and a value.

Attributes

  • name (str): The name of the cookie.
  • value (str): The value of the cookie.

CookieJar

The CookieJar class stores and manages a collection of cookies. A variable of this class is passed as a parameter to the append_jobs function to scrape with the desired cookies.

Methods

  • addCookie(cookie: Cookie): Adds a cookie to the container.
  • addCookies(cookies: list[Cookie]): Adds multiple cookies to the container.
  • showCookies(): Displays all the stored cookies.
  • cleanCookies(): Removes all cookies from the container.
  • getCookies() -> list[dict]: Returns a list of dictionaries with the stored cookies.

Example

cookie1 = Cookie("session", "abc123")
cookie2 = Cookie("user", "username")
cookies = CookieJar()
cookies.addCookies([cookie1, cookie2])

Example usage Actions and Cookies

    batch = Batch(api_key=API_KEY, batch_id=1234, verbose=True)
scroll = Action.create(ActionMode.SCROLLTOBOTTOM)
click = Action.create(ActionMode.CLICK, xpath="//[@id='btn']", wait_navigation=False)
actions = [scroll, click]

#using the CookieJar created in other example
batch.append_jobs(["https://example.com/page3"], {}, ScrapMode.PUPPETEER, actions, cookies)