How does the data in Harvard's Caselaw Access Project compare to CourtListener's case law database?
We have worked closely with the folks at the Harvard Library Innovation Lab and benefitted tremendously from their work and support. They did groundbreaking work digitizing their incredible library of case law and they generously supported work to incorporate it into CourtListener.
We began working with their data several years before its full public release. Along the way, we used human editors and machine learning to make over a million enhancements to the data. These take three main forms:
Correctness
Any dataset of this size is going to have errors in its structure or formatting. We used a machine learning model to systematically correct these errors so that the data could be properly ingested into CourtListener.
Normalization
A number of fields in the Harvard dataset were provided as text, which we normalized into specific fields. For example, in the Harvard data, the
courtfield is plain text. In CourtListener, we've normalized this into our database of verified courts. This is a major undertaking across centuries of data, typos, and errors, but it makes searching the data more reliable, and it eliminates errors that were in the source books.We completed this type of work across a number of fields.
Editorial Cleanup
Because our data is used by many organizations and receives continual enhancement, we benefit from public error reports that we occasionally fix. This flushes out any errors in the source books, their extraction, or our own systems.
Additional Sources
Beyond this, the CourtListener data is a superset of the Harvard case law. CourtListener is sourced from a number of other sources that are merged with the Harvard data. For example, we've added millions of parallel citations, we regularly encorporate other sources of data, we scrape unpublished cases from across the web, and we are scanning books ourselves.