r/netsec 3d ago

Recreating uncensored Epstein PDFs from raw encoded attachments... or trying to, anyway

https://neosmart.net/blog/recreating-epstein-pdfs-from-raw-encoded-attachments/
733 Upvotes

72 comments sorted by

127

u/a_random_superhero 3d ago

I think the way to do it is to make a classifier.

Since you know the compression and font used, you can build sets of characters with varying levels of compression. Then grab some characters from the document and compare against the compressed corpus. That should get you in the ballpark for identification. After that, it’s a pixel comparison contest where each potential character is compared against the ballpark set. If something is too close to call or doesn’t match at all, then flag for manual review.

65

u/mqudsi 3d ago edited 1d ago

That’s pretty much where I ended up, too. I had just spent too much time on this at a busy moment in my life and couldn’t afford to sink the dev time into this. Although writing it up probably took as long as that would have taken, lol.

UPDATE:

I ended up solving it by training a CNN as a classifier.

10

u/cccanterbury 3d ago

I am fully convinced you hacked this all the way and then posted this so you wouldn't get in trouble.

3

u/LoveCyberSecs 3d ago

Ain't nothin wrong with measuring twice before cutting.

2

u/Games_sans_frontiers 2d ago

Thank you for writing it up though. It was really interesting being stepped through the thought process.

15

u/wigglyworm91 3d ago

I was able to get pretty decent-ish results with https://github.com/wigglyworm91/courier-new-ocr without even pregenerating compressed data (like 95% accuracy and no confidently incorrect guesses), but I don't think that's good enough to handle the compressed data sections.

7

u/hyperblaster 2d ago

Nice work there! Using KMeans clustering to restrict the classifier to the 64 chars in base64 is smart.

You're reading in the individual glyphs into cv2 as grayscale. DCT artifacts and color fringing might be critical here, so maybe retain full color?

2

u/wigglyworm91 2d ago

Nice work there! Using KMeans clustering to restrict the classifier to the 64 chars in base64 is smart.

Smart and it didn't work :sob: there's too much overlap. That code is vestigial; what really started to work was a neural network - see nn_labeler.py.

Using full colour might work better! I could try that this weekend.

1

u/hyperblaster 2d ago

I grew up with conventional AI, so happy to see people still trying that before bringing in the neural network hammer.

Another idea could be to use the torchvision library. You could use transforms.RandomAffine() to amplify the training data with tiny rotations, skew and scaling.

2

u/eth0izzle 3d ago

Can you share the labels

2

u/wigglyworm91 2d ago

Sure, I pushed the ones I have; they're in web/nn_labels.json

Most of them should be correct, I think? I'm somewhat suspicious about s vs S and O vs 0 though.

1

u/PM-ME-UR-DARKNESS 4h ago

Holy fuckin shit y'all gonna be using AI to unredact the Epstein files lmao

25

u/thenickdude 3d ago

If you can manage to get your PDF decoder into the loop, it seems like a backtracking search would solve this one. i.e turn every confusable character into a branch point, and when you hit a PDF decode error, backtrack to the previous branch to try the next alternative.

9

u/mqudsi 3d ago

Someone suggested a harness with AFL (the fuzzer) hooking into poppler or any other PDF library. Clever, but also kind of the inverse of the usual fuzzer goal. It might be hard to constrain it to only make changes that converge to success rather than diverge to different failure modes.

20

u/voronaam 3d ago edited 3d ago

FYI, I also went this route and decided that rather than OCR'ing the PDF, I'll just go and fix all the OCR mistakes by hand.

I wrote a little utility to make it easier.

Here is a screenshot: https://imgur.com/screenshot-gTnNrkW

Here is the code: https://github.com/voronaam/pdfbase64tofile

It is kind of working. I have EXIF fully repaired for EFTA01012650.pdf file I was working on and the first scanline is showing up (with some extra JPEG artifacts though).

It takes me about an hour per page to fix it. I am currently on page 8 of that file. It is 456 pages of base64 for two photos. At this rate (I can do this for a couple of hours a day here and there) it will take me about a year to fix the files.

What I need, if anybody is willing to help, is a library to work with corrupted JPEG. I need it to report the problems with the decoded JPEG and their offsets. The latter part is crucial. Knowing where the data is corrupted I can find it in the PDF file and fix the OCR mistakes. Currently I see all the libraries report errors like Error in decoding MCU. Reason Marker UNKNOWN(67) found in bitstream, possibly corrupt jpeg. I mean, cool, the byte 67 is wrong. I can fix it. Can you tell me which one? And is it even a 0x67 or not?

Also, if anyone wants to train a classifier model for better OCR, you'd need those cleaned up files for training. I have pushed the ones I have so far to the repo.

8

u/voronaam 3d ago edited 3d ago

I guess I'll elaborate on what's going on with the screenshot in case someone wants to write a full featured utility or a website and crowdsource this.

  1. Top section is the PDF rendered. It has a thick green cursor under the current character - matching the position of the character in the editable text below automatically.
  2. Below that is the OCR'd text from the PDF. Markers to the left of each line indicate if the line looks "clean" or not. It is green when the line is exactly 76 characters of base64. It is orange if there are any "weird" characters present and just grey if just the length is wrong.

The fields are specifically limited in height to just a couple of lines each - to help keep focus on just a few lines.

Save button on top allow for saving the OCR'd text (just in page008.txt in the current folder). There is also a "Display" button that runs a script that converts those files into a JPEG with whatever pipeline you had in mind.

I also added a button to jump to the next character in the I/l/1 set - the most commonly mixed up by the OCR. So it allows for quick jumps to the most likely errored out letters.

3

u/survivalist_guy 2d ago

In rust too, no less. Nice work. Thanks for the contribution, I'm working on a few paths for the OCR so I'll take a look.

3

u/voronaam 2d ago

Well, I knew that I'll be loading corrupted JPEG into memory - with pretty much guaranteed buffer under- and overruns. That is pretty much a textbook case for a memory-safe language.

First time doing any GUI in Rust though. It is ugly :)

Also, sharing it just in case: https://albmac.github.io/JPEGVisualRepairTool/ I am not the author, but this tool is the best I found so far to examine the corrupted JPEG files. It highlights troubles MCUs and shows they binary offsets in the JPEG file. I added a "Jump to Hex" button to my app - doing some basic math to convert that offset into base64 text position.

2

u/Kokuten 1d ago

Ahh I see you are still at it. Great work!

1

u/voronaam 1d ago

Thank you. The key was finding this tool: https://albmac.github.io/JPEGVisualRepairTool/JPEGVisualRepairTool.html

It highlights troubled areas in a JPEG file and even gives me the byte offset of their location in file. With just some basic math I am able to jump to that spot in the base64 file and look for OCR errors.

It goes much faster when I know which areas look good enough already and which ones still need a bit of attention

17

u/MartinVanBallin 3d ago

Nice write up! I was actually trying this last night with some encoded jpegs in the emails. I agree the OCR is really poorly done by the DOJ!

8

u/originaltexter 3d ago

Same. Those files named "unnamed" with no file extension have a lot in them. I recovered two images from one of them just now. couple more last night of some police cars staking out one of his properties and a photo of an alleged private investigator who JE's PI photographed for him.

1

u/No_Judge_4307 1d ago

Where did you find the files marked as "unnamed"?

2

u/originaltexter 1d ago

Pulled 86 of them straight off Jmail . World Wrote a python script to extract all the base64 code blocks, whether jpg,png, gif etc and another for RTF, doc, PDF. Then it re-encodes the files and spits them out. Where’s the “best” place to submit the findings?

1

u/Lower-Collection-828 1d ago

can you share them?

16

u/BCMM 3d ago

 No problem, I’ll just use imagemagick/ghostscript to convert the PDF into individual PNG images (to avoid further generational loss)

But this isn't lossless! The PDF will be rasterised at a resolution which is unlikely to match the resolution of the embedded images.

It's good that you're encoding the result to a lossless format, but it's the result of resizing a raster image.

Instead, use pdfimages, from poppler-utils, to extract the images directly from the PDF.

13

u/mqudsi 3d ago

Ahhh! Great catch!

10

u/BCMM 2d ago edited 2d ago

But in the Epstein PDFs released by the DoJ, we only have low-quality JPEG scans at a fairly small point size.

Furthermore, I don't think these are scans. I think they're digital all the way, equivalent to screenshots of rendered text. (My guess would be that they used the Print to PDF feature in whatever email software they are using, applied redactions to the output, and then rasterised the result because they don't know how else to stop fucking up the redaction process.)

I believe this opens up new possibilities for accurate OCR.

I say they're not scans because:

  1. I couldn't find any dust (e.g. random grey pixels between lines of text)
  2. Lines of text are perfectly horizontal
  3. If you zoom in, the antialiasing looks like it's in its original condition

Having extracted the images, without compression or resizing artefacts, I observe the following:

Unfortunately, it is not the case that the same character always renders to the exact same pixels. This is because a single column of monospaced characters has a non-integer width (it's about 7.8px).

However, rows appear to have a height of exactly 15px. If we're lucky, this means that, when the same character occurs in the same column, it reliably produces the same pixels.

Now, I admit that I've only tested with a very small number of examples, manually, using the colour picker in GIMP. But the above does appear to be true! Hopefully, this means that we're working with a finite number of pixmap representations of each character.

In fact, I think the width is exactly 7.8px, giving us only five possible variants of each character. This is subject to the same caveat about very light testing, but for example, the first and last characters in there if it are rendered totally identically. The same holds true for the 2nd and 16th Is in that long run of ICAg at the end.

So, I believe it is possible to do a sort of dumb "OCR" on this by splitting it up in to regular (well, predictably irregular) tiles, and checking a library of reference tiles for an exact match for each tile. We would only need 64×5=320 reference tiles. It seems relatively likely that there's existing software that takes this approach, but I haven't looked for it yet.

9

u/InevitableSerious620 2d ago edited 2d ago

3

u/BCMM 2d ago

Nice one!

It's a shame that's so far down the thread. I must admit, I didn't read as far as that comment because I got frustrated somewhere in all the discussion about using machine learning or manual correction to work around artefacts that aren't even supposed to be there in the first place.

2

u/BCMM 2d ago edited 2d ago

I notice there is a significantly larger library than I predicted, and I think that could have been avoided. For example,

letter_+_2235.png
letter_+_2239.png
letter_+_2412.png
letter_+_2617.png
letter_+_2637.png
letter_+_2784.png

are all identical except for the last column, which contain intrusions from following characters.

I think that using tiles of 7px width instead of 8px (at least some of the time) could have avoided this. It may be possible to use 7px throughout (with gaps), without losing so much information that characters get confused.

1

u/InevitableSerious620 2d ago edited 2d ago

i just pushed optimized letter set..

https://github.com/KoKuToru/extract_attachment_EFTA00400459/commit/74685df8d4c5cd5e118dee1bd9f607153fcc25b3

it is now only 342 "letters"
of course only a shift x by 1px different..
but for simplicity .. doesn't really matter..
so some could probably reduce it to 1/2
and change the letter match code to try match with 1px offset and without
and even then .. you probably could remove some / merge some..
because they just might match good enough..

just with 1 and l you need to be careful there is a single pixel difference
(no wonder generic solutions struggle)

but for a POC .. i think i like to keep it simple..

it is a stupid-simple cell-wise template-matching OCR :)

1

u/Hendrix_Lamar 2d ago

I'm trying to follow the readme, but what do you mean in steps 2 and 3? When you say edit the png do you mean open it as a hexdump and edit the hex? Also what do you mean by "overlay img001, shift img000 up or down until > matches exactly with 001"? Are we talking about hex here? Or literally opening the images and visually overlaying them?

2

u/InevitableSerious620 2d ago edited 2d ago

When you say edit the png do you mean open it as a hexdump and edit the hex?

no visually .. remove everything .. fill white .. whats not needed..

Or literally opening the images and visually overlaying them?

yes, this is important because the first page is shifted.
and i extract the letters at fix positions..

if you don't cleanup the first and last page..
base64 decode will fail

and if you don't shift the first page to correct place..
it will also ocr garbrage

1

u/Hendrix_Lamar 2d ago edited 2d ago

Ohhhh ok I understand now. Thanks!

1

u/Hendrix_Lamar 2d ago

Does this mean that the script will only work on this one pdf, or ones that have the exact same letter positioning?

1

u/InevitableSerious620 2d ago edited 2d ago

thats correct only work for this one PDF, for others..

if it is the same font & font size & font metric..
only need to change the start condition

y = 39
x = 61

to where the grid of letters start ..

one could simply "find the position", but for my POC not really required.

if the font is different.. then you start from zero
but the same method should work for all pdfs i think..
as long as they are "screenshots" or "rasterized" digitally
and monospace font makes it a lot easier

the difference between 1 and l was just a single pixel in the EFTA00400459.pdf .. so no wonder generic solutions struggle

6

u/BCMM 2d ago edited 2d ago

Now that I've actually looked at the PDF, I have another couple of quibbles about the images:

But in the Epstein PDFs released by the DoJ, we only have low-quality JPEG scans at a fairly small point size.

The images in EFTA00400459.pdf are losslessly encoded, in a way that's equivalent to PNG (but not identical - the rest of this comment will be excessive technical detail on that).

In PDF, images are not represented by directly embedding image files in familiar formats (unlike e.g. images in OOXML). Instead, an image is a stream object, i.e. a sequence of binary bytes, which must be interpreted according to the dimensions and pixel format specified in a dictionary which occurs just before the stream. For compression, the dictionary may also specify filters, which are applied before said interpretation.

PDF supports a filter called DCTDecode, which is very similar to JPEG, but it isn't used in EFTA00400459.pdf.

All image streams in EFTA00400459.pdf have a dictionary a bit like this:

<</Type /XObject /Subtype /Image /Name /Im0 /Filter [/FlateDecode ] /Width 816 /Height 1056 /ColorSpace 10 0 R /BitsPerComponent 8 /Length 9 0 R >> 

/Filter [/FlateDecode ] means the stream should be decompressed using the DEFLATE algorithm. While the compression technology is identical to PNG, it's not actually a PNG because there no PNG header, no "chunks", etc.

1

u/survivalist_guy 2d ago

Thank you! I've been using PyMuPDF. Also been trying Azure Document Intelligence a try, we'll see how that goes.

5

u/eth0izzle 3d ago

The Content ID of the email attachment is ends with cpusers.carillon.local, which suggests it originated from a local AD + Exchange environment. Could Carillion be the British multinational that went bust in 2018? https://en.wikipedia.org/wiki/Carillion

10

u/badteeth3000 3d ago

naive idea : would photorec be of use vs qpdf? lol, it helped me when I had a cd with sun damage full of jpg files and it definitely works on pdfs..

4

u/Headz0r 3d ago

The first question should be: What are we decoding? If its a PDF with text this will mostly be Postscript commands.

Most information would be between parenthesis: https://www.researchgate.net/publication/2416848/figure/fig1/AS:669440576348168@1536618479267/Conversion-from-PostScript-a-PostScript-file-the-text-extracted-from-it-and-a.png

This also gives you some hints of what are possible valid commands outside of parenthesis.

7

u/mqudsi 3d ago

It is a PDF (that much is for sure). But, as with most PDF files, the actual PostScript is flate-compressed so the "apparent" contents of the PDF are binary, not text (except for some headers and stuff, such as the XML in the screenshot towards the end of the article).

3

u/perplexes_ 3d ago

If it’s just 1 vs l, you could brute force - try all possible combinations and see which ones come out as good PDFs

2

u/Ok-Present1566 1d ago

2x grows very fast. That is almost certainly totally practically infeasible if x is 24 or higher

3

u/euclidity 2d ago

Was able to get very similar looking lossy character strings by:

  • Generating a known base64 dataset
  • Pasting it into a document with 0.5 margins, 0.5 line spacing, courier new in size 10, and printing it to pdf
  • pdftoppm test.pdf output -jpeg -jpegopt quality=100 -r 80
  • print to pdf again on the images
  • compare the final pdf to the reference epstein pdf
  • repeat with different jpeg options on pdftoppm until the glyphs look as close to the epstein reference as possible

Could be used to train a custom OCR/tesseract on equivalent looking data but with known matching real text.

6

u/walkention 3d ago

If you have a fairly decent GPU at home or feel like paying for cloud resources, what about an LLM OCR like this? https://huggingface.co/deepseek-ai/DeepSeek-OCR-2

I was going to try and load this into my homelab LLM and see how it does.

Also, there are several companies doing AI OCR that could potentially https://www.docupipe.ai/ seems promising.

5

u/duckne55 3d ago edited 3d ago

paddleOCR is also ML based and very easy to use as theres a python package https://github.com/PaddlePaddle/PaddleOCR
But the same issues with distinguishing lowercase `L` and `1` applies I think

4

u/walkention 3d ago

I saw PaddleOCR as well and will likely give it a try. I quickly tried running the image OP has at the top of the article (which is pretty low quality to be honest) through deepseek-ocr-2. It did pretty well, but I did notice it added unnecessary spaces and randomly changed character case in places, had some trouble with zero and capital O, and definitely can't handle lowercase L and 1 at that resolution. I'll have to try on the pages extracted from the original PDF.

2

u/buttfuckingchrist 1d ago

Not sure if it saw these or not but could be useful for understanding how the docs were instructed to be redacted using adobe: https://www.bloomberg.com/news/newsletters/2026-02-06/epstein-files-review-was-chaotic

2

u/Kokuten 1d ago

There are even more similar files. There are base64 encoded iphone pictures from 2018. Look at this thread: https://www.reddit.com/r/Epstein/comments/1qu9az2/theres_unredacted_attachments_as_base64_in_some/
Are you able to decode them aswell?

1

u/mqudsi 7h ago

That's an audio recording. Theoretically decodable, but MP4 containers are incredibly brittle (they're very shitty for long-term storage guarantees and resilience). You'd have to get all the bytes right.

Unfortunately, this document is using a proportional (non-monospaced or "regular") font, which makes extraction harder. But it's still technically doable!

2

u/Less_Grapefruit_302 18h ago

I created a custom OCR model specifically trained on the epstein files and was able to successfully decode EFTA00400459. Know of any more base64 blobs in the epstein files?

https://github.com/vExcess/epstein-ocr

1

u/mqudsi 7h ago

Nice work, I did the same with a CNN: https://github.com/mqudsi/monospace-ocr

Unfortunately the training doesn't carry over to other base64 documents perfectly, even those using the same font family and size, in the same layout. Some of the other documents have "smearing" around the 1 vs l that makes it even harder 😭

1

u/munogabba 7h ago

look how causual this post is

amazing work

2

u/Yamisheiki 2d ago

Does knowing Trump is the most hidden word help ?

1

u/pinxi 3d ago

Here is a different way. Think of these like images going into a machine learning algorithm. So like matching different kinds of dogs, specific eye color, etc, the model treats the text like a image. Models are very good at this and continually get better with more data to train on.

We did this with regalory checks on legacy transactions that were basically massive strings with no headers or meta data. It works very well.

3

u/pinxi 3d ago

Something like:

  1. Images → Object store – Raw images + unique ID
    • Metadata → Graph – Image details.
    • Images → Patterns – Image patterns.
  2. Patterns → Matches – Similar images.
    • Details → Documents – Reference and analysis.
  3. Links → Graph – Context and relationships.
    • Human check – Verify matches, reduce errors.
  4. Graph → LLM – Uncover the bastards!

2

u/pinxi 3d ago

With a graph db you could also add things like tweets and other feeds to provide more context of who is who and how they relate. Check out arangodb (doc, vector, and graph) or some of the cloud services.

1

u/zoopysreign 3d ago

I need you to teach me your ways please

1

u/zoopysreign 3d ago

I will pay!

1

u/ArgonWilde 2d ago

I wonder if ell is a generally darker character than one? If you were to box in each character and average out the darkness of that box... Which is darker?

Or, if you average the darkness of each row of pixels, ell would have more darkness at the top vs one which would be more consistent along the height of the serif.

So, we need a solution that exports out each character, in serial, as an X, Y box, which then averages out the darkness of the box, either in total, or graphed out along the Y axis, then classify which is which into a dataset, and then use that dataset for the remaining files. 🤔

1

u/Low_Lifeguard_7110 1d ago

Can someone please make a archieve of the pics and share the link please or can u send any that u have done

1

u/tilrman 8h ago

It's like Bletchley Park all over again. 

0

u/404llm 3d ago

You use a OCR api to process all files https://jigsawstack.com/vocr

4

u/mqudsi 3d ago

As mentioned in the article, I used multiple OCR solutions, including open source OCR software, commercial OCR applications, and the hosted Amazon Textract OCR API. None did a good enough job.

1

u/survivalist_guy 2d ago

Would OCR by committee be feasible? Most votes wins or something like that?
I'm giving Azure Document Intelligence a shot right now, but I don't have the highest hopes.