How to Scrape PACER using Juriscraper
According to Congressional testimony, data aggregation amounts for the majority PACER usage. That means that PACER is, in fact, functioning as an API — It's just a very bad one. Instead of serving structured data like a good API, it serves HTML, has no specification or documentation, and returns a different data schema depending on the item returned. No big deal.
Luckily, you are not the first to tread this path, and this page describes how to scrape PACER using Free Law Project's scraping library, Juriscraper.
Why Use Juriscraper
We highly recommend using Juriscraper rather than trying to scrape and parse PACER yourself. It has hundreds of test cases and has been battle-tested in CourtListener for years. Because it is bulit into CourtListener, it is also maintained, and a small community shares the load of finding and fixing problems. Juriscraper has scraped tens of millions of pages from PACER, and is ready for many of the use cases you can dream up.
But there are things Juriscraper cannot do:
- Many of the reports on PACER haven't been coded up yet.
- Throttling Juriscraper is left to the code that calls it.
- The PACER case locator (PCL) isn't supported yet.
- You can't get your usage or billing information.
And so forth. If you need these features or other as-yet unsupported features, you may want to go your own way. Or you can add these features to Juriscraper so the community can benefit. Your call.
What Juriscraper Can Do
Juriscraper has downloaders and parsers for many reports on PACER at both appellate and district court level. This includes:
- docket pages
- attachment pages
- the history report
- the free opinion report
- the case query page
And so forth. It also has code for logging into PACER and maintaining your session.
Juriscraper is woefully underdocumented, but you can browse the various reports in its repo, here:
For each page that it handles, you'll usually find a Python object as your entry point and two important functions, one for downloading something and the other for parsing it. For example, the juriscraper.pacer.DocketReport object has a lot of code, but the parse and the query methods get you almost everything you need.
The session management code is handled by juriscraper.pacer.PacerSession.
Therefore, to download a docket from PACER, you can do something like this:
from juriscraper import DocketReport, PacerSession
# Create a session and log in
s = PacerSession(username='foo', password='bar')
s.login()
# Create a docket report object for the court with the session
r = DocketReport('dcd', s)
# Query a case by its internal ID (this incurs a charge)
r.query(178502)
# Get the data
print(r.data)
The other PACER reports in Juriscraper follow this pattern:
- Log into PACER
- Make a report object
- Query something
- Grab the data as JSON
If you look in the code for DocketReport, you'll see that there are a variety of other methods too, like parties, metadata, etc, that can help zero-in on what you need.
If you need example usages, check the tests in Juriscraper or look in CourtListener's code, where there are numous examples.
Other Options
If you don't want to do this yourself, you can try our RECAP Fetch API, which gives you a clean API for scraping PACER. You give it your PACER credentials and it does the work.
Alternatively, if you want to do this by hand, you could also use our RECAP Extension. Simply install it, buy the things you need by browsing PACER, and they will be added to our database and APIs.
Tips and Tricks for Scraping PACER
How Do the Courts Monitor?
In our experience, if your traffic blends in with the existing traffic and doesn't intefere with their uptime, you'll be good to go. It doesn't seem like they do a ton of monitoring. Remember that orgs like Bloomberg and Free Law Project do incredible amounts of scraping.
So long as your scraper is just more background noise you're probably fine.
How Fast Can I Go?
Each court runs on indepent servers, so that's your base line. If you want to gather a lot of data, we recommend a round-robin approach across the PACER servers. This will allow you to get around 200 items per loop (one per court), and even with a short delay allows aggressive data aggregation even without hitting the individual court servers in parallel.
As for how fast to go, that depends what you're doing and you'll need to develop an intuition. The courts ask that you crawl at night and on weekends, which may be a reasonable request, depending on your use case.
Some things to consider:
Don't bring down the servers.
Remember that these servers are old and feeble and fairly easy to overload. If you take out a server, you're doing harm to justice, so don't do that. The way to avoid this is to slowly scale up your work while carefully monitoring HTTP response times. If your response times are going up, you may be the reason. Be careful.
Small jurisdictions have small servers.
This may be old information, but our experience is that smaller jurisdictions scale worse than big ones. This may be because the big ones have staff and can better absorb your crawler into their larger day-to-day traffic load. Be careful when crawling less-populous jurisdictions.
You may get blocked.
If you crawl too fast, you may get blocked by the court. The blocking system will usually unblock you automatically after some amount of time, but if not, a phone call to the court usually works. Failing that, you need a new IP address.
We haven't seen evidence that the IT folks can easily connect an IP address to an account and block the account. It's probably hard for them to do that, but maybe we've been lucky.
Some things are hard for them and easy for you.
Crawling a really big docket is just as easy for you as a crawling a small one, but makes a bigger impact on the servers running PACER. This means, for example, that if you're crawling huge class action cases, you should go slower than if you're crawling smaller, shorter cases.
This also seems to apply to larger documents (measured in bytes, not pages), so if you're crawling scanned documents, you need to go slower than born-digital ones (if you can detect that).
Ultimately, pay attention to PACER and develop an intuition for what takes time for it to do. Then, when you crawl, adjust your system accodingly.
Pay attention to your RTTs
The gold standard for crawling is to monitor the round trip times (RTTs) of your crawl and to slow down when your RTTs get longer or go faster when they are shorter.
IP Address Pinning
If you are using multiple IP addresses to crawl PACER, you'll need to log in and maintain sessions per IP address. This is because cookies from PACER are associated with the IP address you used to create them and means that you to use the cookies accordingly. It's possible to have many simultaneous sessions in PACER, but you hvae to crawl from the same IP address you logged in from.