ScrapingPros Client for Python
A client for interacting with the ScrapingPros API. The client provides an interface to manage projects, batches, and jobs for web scraping through the ScrapingPros service. The API offers various scraping modes, including HTTP requests, Selenium, and Puppeteer.
For more information, visit ScrapingPros Console.
Installation
To install the dependencies for this package, you may need to install requests
and then install the scrapingpros
package:
pip install requests
pip install scrapingpros
Classes
ScrapMode
Constants to define the method for scraping.
- HTTP_REQUEST: Use HTTP requests for scraping.
- SELENIUM: Use Selenium for browser automation.
- PUPPETEER: Use Puppeteer for headless browsing.
ProjectManager
Class for managing project-related operations.
Methods
Instanciate a ProjectManager
ProjectManager(api_key: str, verbose: bool)
Creates a ProjectManager.
- Params
- api_key: API key found after logging using API endpoint /login or at ScrapingPros at Home profile page after logging in.
- verbose: Default set as True.
Project creation
create_project(name: str, priority: int, description: str)
Creates a new project with a specified priority and optional description.
- Params
- name.
- priority.
- description: Set as None by default.
- Returns
- Project object.
Close project
close_project(project_id: int)
Closes a project and returns a confirmation.
- Params
- project_id.
- Returns
- Dict containing message of confirmation or failure.
Batch
Represents a batch of jobs associated with a project.
Attributes
- batch_id: Unique identifier for the batch.
- jobs: List of jobs in the batch.
Methods
Batch must be created previously by using create_batch from Project.
Get a Batch
Batch(api_key: str, batch_id: int, verbose: bool)
Gets a batch object.
- Params
- api_key: API key found after logging using API endpoint /login or at ScrapingPros at Home profile page after logging in.
- batch_id.
- verbose: Default set as True.
- Returns
- Batch object.
Scrap Modes information
get_scrap_modes_info()
Fetches and returns scrap mode information and job information (not html data).
- Returns
{
"result": {
"S": { # Status (S for Success)
"HTTP_REQUEST": {
"count": 10,
"cost": 5.0
},
"SELENIUM": {
"count": 5,
"cost": 7.5
}
},
"P": { # Status (P for Pending)
"HTTP_REQUEST": 3,
"SELENIUM": 2
}
}
}
Analyze Scrap Modes
analyze_scrap_modes()
Provides a summary of job success, costs, and pending jobs.
- Returns
{
"successful_jobs": {
"http_request": {
"count": 1,
"cost": 1
}
},
"pending_jobs": {
"http_request": 1
},
"total_cost": 1,
"total_jobs": 2
}
Appends jobs to batch
append_jobs(job_urls: list[str], arguments: dict[str, str], scrap_mode: ScrapMode, actions: list[Action], cookies: CookieJar)
Adds jobs to the batch using a specified scrape mode.
- Params
- jobs_urls: List containing URLs to scrap.
- arguments: Arguments for the jobs. Example: {"scraper":"detail_job"}.
- scrap_mode: ScrapMode.
- actions: A list of actions to perform on each page. The actions will be executed in the order they appear in the list. See Actions.
- cookies: The cookies to use for the request. See Cookies
- Returns
- Does not have a default value.
Run batch
run()
Initiates batch execution with retries if necessary.
Get data from batch
This function retrieves data from all the jobs in the batch. It does not wait for all jobs to finish.
The function provides two options:
- Retrieve data as a compressed
.zip
file. - Retrieve only the HTML data.
Parameters:
html_only
(bool):- If
True
, it retrieves only the HTML content from the jobs. - If
False
, it downloads a.zip
file with the results. - Default value:
False
.
- If
save_path
(str, optional):- The path where the
.zip
file will be saved ifhtml_only
isFalse
. - If not provided, the file will be saved in the current directory.
- This parameter is ignored if
html_only
isTrue
.
- The path where the
Returns:
- A dictionary containing either the downloaded file or the JSON data, depending on the selected option.
- If
html_only
isTrue
, it returns a dictionary with the JSON data from the batch. - If
html_only
isFalse
, it returns a dictionary with the path to the downloaded.zip
file.
- If
Exceptions:
- Raises a
ValueError
ifsave_path
is provided but is not a valid directory.
Example usage:
# Retrieve only HTML data
data = get_data(html_only=True)
# Download the results as a ZIP file
data = get_data(html_only=False, save_path='/path/to/folder')
- **Returns**
- Retrieves data download from all jobs in the batch.
***
## Project
Represents a group of batches related to a project.
### Attributes
* ***project_id***: Unique identifier for the project.
### Methods
#### Get a Project
```python
Project(api_key: str, project_id: int, verbose: bool)
Gets a Project object. Project must be created previously.
- Params
- api_key: API key found after logging using API endpoint /login or at ScrapingPros at Home profile page after logging in.
- project_id.
- verbose: Default set as True.
- Returns
- Project object.
Create a batch for the project
create_batch(batch_name: str)
Creates a batch under the project.
- Params
- batch_name.
- Returns
- Batch object.
Get all batches from project
get_batches()
- Returns
- A list containing all batches from project.
Get projects summary
get_projects_summary(api_key: str, verbose: bool)
Retrieves a summary of all projects for the user.
- Params
- api_key: API key found after logging using API endpoint /login or at ScrapingPros at Home profile page after logging in.
- verbose: Default set as True.
- Returns
{
"summary": [
{
"project_id": 123,
"name": "Project Name",
...other project details...
},
...
]
}
Example
from scraping_pros import ScrapMode, ProjectManager
# Initialize API client
api_key = "your_api_key"
manager = ProjectManager(api_key)
# Create a new project
project = manager.create_project("Test Project", priority=1, description="Example project")
# Create a batch under the project
batch = project.create_batch("Test Batch")
# Add jobs to the batch (without Actions and Cookies)
batch.append_jobs(["https://example.com/page1", "https://example.com/page2"], {"scraper": "list_jobs"}, ScrapMode.HTTP_REQUEST, None, None)
# Run the batch
batch.run()
# Retrieve data from completed jobs
data = batch.get_data()
print(data)
Actions
The Actions
module provides different types of interactions for the job to execute. Each action represents a specific operation, such as clicking an element, scrolling, typing text, or waiting.
These will be the actions to perform during the scraping process and will be passed as a list of Actions as a parameter to the append_jobs function.
The actions will be executed in the order they appear in the list.
ActionMode
The available action modes are:
CLICK
: Clicks on an element identified by an XPath.SCROLLTOBOTTOM
: Scrolls to the bottom of the page.TYPEONINPUT
: Types a given text into an input field identified by an XPath.WAIT
: Waits for a specified time (in milliseconds).
Usage
To create an action, use the Action.create
method:
from actions import Action, ActionMode
# Click action
click_action = Action.create(ActionMode.CLICK, xpath="//button[@id='submit']", wait_navigation=True)
# Scroll to bottom
scroll_action = Action.create(ActionMode.SCROLLTOBOTTOM)
# Type in an input field
type_action = Action.create(ActionMode.TYPEONINPUT, xpath="//input[@name='email']", text="user@example.com")
# Wait for 2000 milliseconds
wait_action = Action.create(ActionMode.WAIT, time=2000)
How to send actions for the Scraper
Cookies
The Cookie class represents an individual cookie with a name and a value.
Attributes
- name (str): The name of the cookie.
- value (str): The value of the cookie.
CookieJar
The CookieJar class stores and manages a collection of cookies. A variable of this class is passed as a parameter to the append_jobs function to scrape with the desired cookies.
Methods
- addCookie(cookie: Cookie): Adds a cookie to the container.
- addCookies(cookies: list[Cookie]): Adds multiple cookies to the container.
- showCookies(): Displays all the stored cookies.
- cleanCookies(): Removes all cookies from the container.
- getCookies() -> list[dict]: Returns a list of dictionaries with the stored cookies.
Example
cookie1 = Cookie("session", "abc123")
cookie2 = Cookie("user", "username")
cookies = CookieJar()
cookies.addCookies([cookie1, cookie2])
Example usage Actions and Cookies
batch = Batch(api_key=API_KEY, batch_id=1234, verbose=True)
scroll = Action.create(ActionMode.SCROLLTOBOTTOM)
click = Action.create(ActionMode.CLICK, xpath="//[@id='btn']", wait_navigation=False)
actions = [scroll, click]
#using the CookieJar created in other example
batch.append_jobs(["https://example.com/page3"], {}, ScrapMode.PUPPETEER, actions, cookies)