Deterministic Parallel Download of Salesforce ContentVersion Files Using Robot Framework and Pabot

Hello everyone,

I wanted to share a framework I recently built using Robot Framework, SeleniumLiberary and Pabot, and ask for feedback from others who have worked with Salesforce at scale.

The use case involves downloading large volumes of Salesforce ContentVersion files. In this context, daily API limits, session-scoped authentication, and UI mediated file delivery through the Shepherd endpoint introduce constraints that make naive parallel execution unreliable.

Context

When attempting high-volume file extraction, common challenges include:

  • API quota amplification caused by retries

  • Instability across concurrent browser sessions

  • Non-deterministic workload distribution

  • Flakiness due to UI mediated delivery mechanisms

Robot Framework and Pabot provide strong orchestration capabilities, but ensuring bounded concurrency and predictable scaling behavior remains an architectural responsibility.

Approach

The framework applies:

  • Deterministic ContentDocumentId partitioning across workers

  • Separation of REST-based metadata retrieval and Selenium driven UI based binary download

  • Bounded process-level parallelism using Pabot

  • Single attempt execution to reduce quota amplification

  • Structured validation and logging for observability

The primary goal was not maximum throughput, but predictable and reproducible behavior under Salesforce platform constraints.

I would appreciate feedback from the community on:

  • Patterns others use for quota safe parallel execution

  • Techniques to reduce flakiness under UI mediated delivery

  • Observability practices for large parallel suites

GitHub Repository:

Thank you to the maintainers and community for the continued work on Robot Framework and Pabot.

1 Like

Minor correction: SeleniumLibrary (typo in original post).

Hi Bhimeswara,

Lets see if I understood the problem:

  • You first login to the Salesforce web UI with SeleniumLiberary, gather a list of Salesforce ContentVersion files that you need to download
  • Then you iterate over that list ContentVersion files and download them via the Salesforce API
  • You want to leverage pabot to download the ContentVersion files in parallel
  • You have an API quota limit you can’t exceed

If I’ve understood correctly, then here’s the approach I would take:

Break the problem up into smaller pieces

  • 1 test case for Web UI part, login gather the list you need and store them somewhere that you can use that list as a queue, I would use TestDataTable for that (tool I created) but you could also use an an MQ server (like RabbitMQ) or a database server, your choice the important part here is not to use a file (text file or excel file) because when you run with pabot you’ll encounter locking issues
    • if you want to get fancy you could put a check at the beginning of this test case, and exit if the queue has data in it
  • next you’ll want to create the test case that downloads 1 ContentVersion file via the API, but rather than creating it as a test case, create it as a keyword that you can the use as a test template, this test case would
    • read a ContentVersion file id number / url from the queue we created above, if you use TestDataTable it will be removed from the queue automatically for you, otherwise you may need to mark it as consumed / used / in use, when you read it
      • if the queue is empty exit test as passed here
    • perform the API calls that download the ContentVersion file
    • save the ContentVersion file somewhere
  • next you’ll probably want to use the templates with loops feature of robot framework, to run your test template n number of times (e.g. FOR ${index} IN RANGE 1000) where that limit would be however many API calls you can make in a day without exceeding your quota (this might need some trial runs? run 1000 and see how much of the quota you used and estimate the limit from that)

I would suggest putting the WebUI test in one file and the API call tests and test template in another to make it easier for you to manage

  • Day 1
    • run the WebUI test first, just with robot as it’s a single tests
    • run the API tests with pabot
  • Day 2
    • run only the API tests with pabot

Repeating Day 2 until the queue is empty.

Hope that helps,

Dave.

1 Like

Hi Dave,

Thank you for the detailed response, I appreciate the thoughtful breakdown.

Just to clarify the architecture: authentication is handled via Salesforce CLI, which provides the access token and instance context used across both the REST and browser layers.

Selenium is used for headless browser-based binary downloads because Salesforce delivers the actual file payload through the Shepherd UI endpoint. REST (SOQL against ContentVersion and ContentDocumentLink) is used for metadata retrieval and for generating Data Loader–ready mapping files, not for the binary transfer itself.

The separation is intentional:

  • Input: ContentDocumentIds are provided through structured Excel files that define the workload

  • Authentication layer: Salesforce CLI for session and token management

  • REST plane: Metadata extraction and Data Loader file generation

  • Binary plane: Selenium-driven headless browser download of file content

The core challenge I’ve been focusing on is safely parallelizing the Selenium-driven download layer using Pabot while enforcing bounded concurrency to avoid session instability and quota amplification.

Your suggestion around queue-based work distribution is interesting. In the current implementation I use deterministic partitioning of ContentDocumentIds to keep runs reproducible and avoid shared state, but I can see how a queue could help with dynamic balancing.

If you’ve seen effective patterns for improving stability in headless browser file downloads at scale under parallel execution, I would really value your perspective.

I’ve attached a high-level architecture diagram below to illustrate the separation more clearly.

Thanks again for engaging with this.

Hi Bhimeswara,

Hopefully someone can give better tips than me on improving stability in headless browser file downloads, as I don’t use SeleniumLibrary very often. (As I mostly do performance testing)

I will mention that Pabot creates a separate robot framework process for each robot it runs (this is also what performance testing tools do). this means when the browser runs in each robot it’s completely isolated from the browsers being run by the other robots, this should help with stability.

Also if you keep the test short and only download 1 file per test, this will also help with stability, as each test will start “fresh”, browsers consume a lot of memory and get more unstable the longer you use them, in normal usage, this is not a big problem as people tend to stay on the same page for quite a while and don’t do many page loads / downloads per hour, and this is why you might find your browser needs to be restarted if you’ve been using is constantly for a week without closing it, but with test automation this gets exacerbated as the automation uses the browser much faster than a human would.

Looking at your high-level architecture diagram, I noticed that you gather the access token before running robot. it might be a good idea to move this step into a Suite setup;

  • firstly because this will make it easier to use the access token in the robot script
  • but also because then each robot might get a separate access token, this might help with stability (on the server side), or depending on how your API quotas are applied might give you more downloads per day?

Dave.

Hi Dave,

Thank you, that’s helpful context.

I completely agree on the isolation point. One of the reasons I chose Pabot was precisely because each worker runs in a separate process, which gives clean browser isolation and reduces cross-test interference. In the current implementation, each worker handles a single file per test case for the same reason you mentioned, to keep browser sessions short-lived and reduce memory buildup.

Regarding authentication, that’s really a good point. At the moment, the Salesforce CLI step runs before invoking Robot primarily to establish a consistent authenticated context and pass the access token and instance URL into the execution environment. This keeps the control plane deterministic across workers.

Moving it into Suite Setup is interesting, especially if each Pabot process would obtain its own access token. That could potentially distribute server-side load differently depending on how Salesforce applies quotas and session limits. I will experiment with that and observe whether separate tokens per worker have any measurable impact on stability or quota behavior.

Thanks again for the thoughtful suggestions.

B Vamsi Punnam.

1 Like