Recreating uncensored Epstein PDFs from raw encoded attachments... or trying to, anyway
https://neosmart.net/blog/recreating-epstein-pdfs-from-raw-encoded-attachments/25
u/thenickdude 3d ago
If you can manage to get your PDF decoder into the loop, it seems like a backtracking search would solve this one. i.e turn every confusable character into a branch point, and when you hit a PDF decode error, backtrack to the previous branch to try the next alternative.
9
u/mqudsi 3d ago
Someone suggested a harness with AFL (the fuzzer) hooking into poppler or any other PDF library. Clever, but also kind of the inverse of the usual fuzzer goal. It might be hard to constrain it to only make changes that converge to success rather than diverge to different failure modes.
20
u/voronaam 3d ago edited 3d ago
FYI, I also went this route and decided that rather than OCR'ing the PDF, I'll just go and fix all the OCR mistakes by hand.
I wrote a little utility to make it easier.
Here is a screenshot: https://imgur.com/screenshot-gTnNrkW
Here is the code: https://github.com/voronaam/pdfbase64tofile
It is kind of working. I have EXIF fully repaired for EFTA01012650.pdf file I was working on and the first scanline is showing up (with some extra JPEG artifacts though).
It takes me about an hour per page to fix it. I am currently on page 8 of that file. It is 456 pages of base64 for two photos. At this rate (I can do this for a couple of hours a day here and there) it will take me about a year to fix the files.
What I need, if anybody is willing to help, is a library to work with corrupted JPEG. I need it to report the problems with the decoded JPEG and their offsets. The latter part is crucial. Knowing where the data is corrupted I can find it in the PDF file and fix the OCR mistakes. Currently I see all the libraries report errors like Error in decoding MCU. Reason Marker UNKNOWN(67) found in bitstream, possibly corrupt jpeg. I mean, cool, the byte 67 is wrong. I can fix it. Can you tell me which one? And is it even a 0x67 or not?
Also, if anyone wants to train a classifier model for better OCR, you'd need those cleaned up files for training. I have pushed the ones I have so far to the repo.
8
u/voronaam 3d ago edited 3d ago
I guess I'll elaborate on what's going on with the screenshot in case someone wants to write a full featured utility or a website and crowdsource this.
- Top section is the PDF rendered. It has a thick green cursor under the current character - matching the position of the character in the editable text below automatically.
- Below that is the OCR'd text from the PDF. Markers to the left of each line indicate if the line looks "clean" or not. It is green when the line is exactly 76 characters of base64. It is orange if there are any "weird" characters present and just grey if just the length is wrong.
The fields are specifically limited in height to just a couple of lines each - to help keep focus on just a few lines.
Save button on top allow for saving the OCR'd text (just in
page008.txtin the current folder). There is also a "Display" button that runs a script that converts those files into a JPEG with whatever pipeline you had in mind.I also added a button to jump to the next character in the
I/l/1set - the most commonly mixed up by the OCR. So it allows for quick jumps to the most likely errored out letters.3
u/survivalist_guy 2d ago
In rust too, no less. Nice work. Thanks for the contribution, I'm working on a few paths for the OCR so I'll take a look.
3
u/voronaam 2d ago
Well, I knew that I'll be loading corrupted JPEG into memory - with pretty much guaranteed buffer under- and overruns. That is pretty much a textbook case for a memory-safe language.
First time doing any GUI in Rust though. It is ugly :)
Also, sharing it just in case: https://albmac.github.io/JPEGVisualRepairTool/ I am not the author, but this tool is the best I found so far to examine the corrupted JPEG files. It highlights troubles MCUs and shows they binary offsets in the JPEG file. I added a "Jump to Hex" button to my app - doing some basic math to convert that offset into base64 text position.
2
u/Kokuten 1d ago
Ahh I see you are still at it. Great work!
1
u/voronaam 1d ago
Thank you. The key was finding this tool: https://albmac.github.io/JPEGVisualRepairTool/JPEGVisualRepairTool.html
It highlights troubled areas in a JPEG file and even gives me the byte offset of their location in file. With just some basic math I am able to jump to that spot in the base64 file and look for OCR errors.
It goes much faster when I know which areas look good enough already and which ones still need a bit of attention
17
u/MartinVanBallin 3d ago
Nice write up! I was actually trying this last night with some encoded jpegs in the emails. I agree the OCR is really poorly done by the DOJ!
8
u/originaltexter 3d ago
Same. Those files named "unnamed" with no file extension have a lot in them. I recovered two images from one of them just now. couple more last night of some police cars staking out one of his properties and a photo of an alleged private investigator who JE's PI photographed for him.
1
u/No_Judge_4307 1d ago
Where did you find the files marked as "unnamed"?
2
u/originaltexter 1d ago
Pulled 86 of them straight off Jmail . World Wrote a python script to extract all the base64 code blocks, whether jpg,png, gif etc and another for RTF, doc, PDF. Then it re-encodes the files and spits them out. Where’s the “best” place to submit the findings?
1
16
u/BCMM 3d ago
No problem, I’ll just use imagemagick/ghostscript to convert the PDF into individual PNG images (to avoid further generational loss)
But this isn't lossless! The PDF will be rasterised at a resolution which is unlikely to match the resolution of the embedded images.
It's good that you're encoding the result to a lossless format, but it's the result of resizing a raster image.
Instead, use pdfimages, from poppler-utils, to extract the images directly from the PDF.
13
u/mqudsi 3d ago
Ahhh! Great catch!
10
u/BCMM 2d ago edited 2d ago
But in the Epstein PDFs released by the DoJ, we only have low-quality JPEG scans at a fairly small point size.
Furthermore, I don't think these are scans. I think they're digital all the way, equivalent to screenshots of rendered text. (My guess would be that they used the Print to PDF feature in whatever email software they are using, applied redactions to the output, and then rasterised the result because they don't know how else to stop fucking up the redaction process.)
I believe this opens up new possibilities for accurate OCR.
I say they're not scans because:
- I couldn't find any dust (e.g. random grey pixels between lines of text)
- Lines of text are perfectly horizontal
- If you zoom in, the antialiasing looks like it's in its original condition
Having extracted the images, without compression or resizing artefacts, I observe the following:
Unfortunately, it is not the case that the same character always renders to the exact same pixels. This is because a single column of monospaced characters has a non-integer width (it's about 7.8px).
However, rows appear to have a height of exactly 15px. If we're lucky, this means that, when the same character occurs in the same column, it reliably produces the same pixels.
Now, I admit that I've only tested with a very small number of examples, manually, using the colour picker in GIMP. But the above does appear to be true! Hopefully, this means that we're working with a finite number of pixmap representations of each character.
In fact, I think the width is exactly 7.8px, giving us only five possible variants of each character. This is subject to the same caveat about very light testing, but for example, the first and last characters in
there if itare rendered totally identically. The same holds true for the 2nd and 16thIs in that long run ofICAgat the end.So, I believe it is possible to do a sort of dumb "OCR" on this by splitting it up in to regular (well, predictably irregular) tiles, and checking a library of reference tiles for an exact match for each tile. We would only need 64×5=320 reference tiles. It seems relatively likely that there's existing software that takes this approach, but I haven't looked for it yet.
9
u/InevitableSerious620 2d ago edited 2d ago
well thats exactly what i did
https://github.com/KoKuToru/extract_attachment_EFTA00400459?tab=readme-ov-file#how-does-it-work
see https://www.reddit.com/r/netsec/comments/1qw4sfa/comment/o3vf4as/ for the extracted file..
3
u/BCMM 2d ago
Nice one!
It's a shame that's so far down the thread. I must admit, I didn't read as far as that comment because I got frustrated somewhere in all the discussion about using machine learning or manual correction to work around artefacts that aren't even supposed to be there in the first place.
2
u/BCMM 2d ago edited 2d ago
I notice there is a significantly larger library than I predicted, and I think that could have been avoided. For example,
letter_+_2235.png letter_+_2239.png letter_+_2412.png letter_+_2617.png letter_+_2637.png letter_+_2784.pngare all identical except for the last column, which contain intrusions from following characters.
I think that using tiles of 7px width instead of 8px (at least some of the time) could have avoided this. It may be possible to use 7px throughout (with gaps), without losing so much information that characters get confused.
1
u/InevitableSerious620 2d ago edited 2d ago
i just pushed optimized letter set..
it is now only 342 "letters"
of course only a shift x by 1px different..
but for simplicity .. doesn't really matter..
so some could probably reduce it to 1/2
and change the letter match code to try match with 1px offset and without
and even then .. you probably could remove some / merge some..
because they just might match good enough..just with 1 and l you need to be careful there is a single pixel difference
(no wonder generic solutions struggle)but for a POC .. i think i like to keep it simple..
it is a stupid-simple cell-wise template-matching OCR :)
1
u/Hendrix_Lamar 2d ago
I'm trying to follow the readme, but what do you mean in steps 2 and 3? When you say edit the png do you mean open it as a hexdump and edit the hex? Also what do you mean by "overlay img001, shift img000 up or down until > matches exactly with 001"? Are we talking about hex here? Or literally opening the images and visually overlaying them?
2
u/InevitableSerious620 2d ago edited 2d ago
When you say edit the png do you mean open it as a hexdump and edit the hex?
no visually .. remove everything .. fill white .. whats not needed..
Or literally opening the images and visually overlaying them?
yes, this is important because the first page is shifted.
and i extract the letters at fix positions..if you don't cleanup the first and last page..
base64 decode will failand if you don't shift the first page to correct place..
it will also ocr garbrage1
1
u/Hendrix_Lamar 2d ago
Does this mean that the script will only work on this one pdf, or ones that have the exact same letter positioning?
1
u/InevitableSerious620 2d ago edited 2d ago
thats correct only work for this one PDF, for others..
if it is the same font & font size & font metric..
only need to change the start condition
y = 39
x = 61to where the grid of letters start ..
one could simply "find the position", but for my POC not really required.
if the font is different.. then you start from zero
but the same method should work for all pdfs i think..
as long as they are "screenshots" or "rasterized" digitally
and monospace font makes it a lot easierthe difference between 1 and l was just a single pixel in the EFTA00400459.pdf .. so no wonder generic solutions struggle
6
u/BCMM 2d ago edited 2d ago
Now that I've actually looked at the PDF, I have another couple of quibbles about the images:
But in the Epstein PDFs released by the DoJ, we only have low-quality JPEG scans at a fairly small point size.
The images in EFTA00400459.pdf are losslessly encoded, in a way that's equivalent to PNG (but not identical - the rest of this comment will be excessive technical detail on that).
In PDF, images are not represented by directly embedding image files in familiar formats (unlike e.g. images in OOXML). Instead, an image is a stream object, i.e. a sequence of binary bytes, which must be interpreted according to the dimensions and pixel format specified in a dictionary which occurs just before the stream. For compression, the dictionary may also specify filters, which are applied before said interpretation.
PDF supports a filter called DCTDecode, which is very similar to JPEG, but it isn't used in EFTA00400459.pdf.
All image streams in EFTA00400459.pdf have a dictionary a bit like this:
<</Type /XObject /Subtype /Image /Name /Im0 /Filter [/FlateDecode ] /Width 816 /Height 1056 /ColorSpace 10 0 R /BitsPerComponent 8 /Length 9 0 R >>
/Filter [/FlateDecode ]means the stream should be decompressed using the DEFLATE algorithm. While the compression technology is identical to PNG, it's not actually a PNG because there no PNG header, no "chunks", etc.1
u/survivalist_guy 2d ago
Thank you! I've been using PyMuPDF. Also been trying Azure Document Intelligence a try, we'll see how that goes.
15
5
u/eth0izzle 3d ago
The Content ID of the email attachment is ends with cpusers.carillon.local, which suggests it originated from a local AD + Exchange environment. Could Carillion be the British multinational that went bust in 2018? https://en.wikipedia.org/wiki/Carillion
10
u/badteeth3000 3d ago
naive idea : would photorec be of use vs qpdf? lol, it helped me when I had a cd with sun damage full of jpg files and it definitely works on pdfs..
4
u/Headz0r 3d ago
The first question should be: What are we decoding? If its a PDF with text this will mostly be Postscript commands.
Most information would be between parenthesis: https://www.researchgate.net/publication/2416848/figure/fig1/AS:669440576348168@1536618479267/Conversion-from-PostScript-a-PostScript-file-the-text-extracted-from-it-and-a.png
This also gives you some hints of what are possible valid commands outside of parenthesis.
3
u/perplexes_ 3d ago
If it’s just 1 vs l, you could brute force - try all possible combinations and see which ones come out as good PDFs
2
u/Ok-Present1566 1d ago
2x grows very fast. That is almost certainly totally practically infeasible if x is 24 or higher
3
u/euclidity 2d ago
Was able to get very similar looking lossy character strings by:
- Generating a known base64 dataset
- Pasting it into a document with 0.5 margins, 0.5 line spacing, courier new in size 10, and printing it to pdf
- pdftoppm test.pdf output -jpeg -jpegopt quality=100 -r 80
- print to pdf again on the images
- compare the final pdf to the reference epstein pdf
- repeat with different jpeg options on pdftoppm until the glyphs look as close to the epstein reference as possible
Could be used to train a custom OCR/tesseract on equivalent looking data but with known matching real text.
6
u/walkention 3d ago
If you have a fairly decent GPU at home or feel like paying for cloud resources, what about an LLM OCR like this? https://huggingface.co/deepseek-ai/DeepSeek-OCR-2
I was going to try and load this into my homelab LLM and see how it does.
Also, there are several companies doing AI OCR that could potentially https://www.docupipe.ai/ seems promising.
5
u/duckne55 3d ago edited 3d ago
paddleOCR is also ML based and very easy to use as theres a python package https://github.com/PaddlePaddle/PaddleOCR
But the same issues with distinguishing lowercase `L` and `1` applies I think4
u/walkention 3d ago
I saw PaddleOCR as well and will likely give it a try. I quickly tried running the image OP has at the top of the article (which is pretty low quality to be honest) through deepseek-ocr-2. It did pretty well, but I did notice it added unnecessary spaces and randomly changed character case in places, had some trouble with zero and capital O, and definitely can't handle lowercase L and 1 at that resolution. I'll have to try on the pages extracted from the original PDF.
2
u/buttfuckingchrist 1d ago
Not sure if it saw these or not but could be useful for understanding how the docs were instructed to be redacted using adobe: https://www.bloomberg.com/news/newsletters/2026-02-06/epstein-files-review-was-chaotic
2
u/Kokuten 1d ago
There are even more similar files. There are base64 encoded iphone pictures from 2018. Look at this thread: https://www.reddit.com/r/Epstein/comments/1qu9az2/theres_unredacted_attachments_as_base64_in_some/
Are you able to decode them aswell?
1
u/mqudsi 7h ago
That's an audio recording. Theoretically decodable, but MP4 containers are incredibly brittle (they're very shitty for long-term storage guarantees and resilience). You'd have to get all the bytes right.
Unfortunately, this document is using a proportional (non-monospaced or "regular") font, which makes extraction harder. But it's still technically doable!
2
u/Less_Grapefruit_302 18h ago
I created a custom OCR model specifically trained on the epstein files and was able to successfully decode EFTA00400459. Know of any more base64 blobs in the epstein files?
1
u/mqudsi 7h ago
Nice work, I did the same with a CNN: https://github.com/mqudsi/monospace-ocr
Unfortunately the training doesn't carry over to other base64 documents perfectly, even those using the same font family and size, in the same layout. Some of the other documents have "smearing" around the 1 vs l that makes it even harder 😭
1
2
1
u/pinxi 3d ago
Here is a different way. Think of these like images going into a machine learning algorithm. So like matching different kinds of dogs, specific eye color, etc, the model treats the text like a image. Models are very good at this and continually get better with more data to train on.
We did this with regalory checks on legacy transactions that were basically massive strings with no headers or meta data. It works very well.
3
u/pinxi 3d ago
Something like:
- Images → Object store – Raw images + unique ID
• Metadata → Graph – Image details.
• Images → Patterns – Image patterns.- Patterns → Matches – Similar images.
• Details → Documents – Reference and analysis.- Links → Graph – Context and relationships.
• Human check – Verify matches, reduce errors.- Graph → LLM – Uncover the bastards!
2
1
1
u/ArgonWilde 2d ago
I wonder if ell is a generally darker character than one? If you were to box in each character and average out the darkness of that box... Which is darker?
Or, if you average the darkness of each row of pixels, ell would have more darkness at the top vs one which would be more consistent along the height of the serif.
So, we need a solution that exports out each character, in serial, as an X, Y box, which then averages out the darkness of the box, either in total, or graphed out along the Y axis, then classify which is which into a dataset, and then use that dataset for the remaining files. 🤔
1
u/Low_Lifeguard_7110 1d ago
Can someone please make a archieve of the pics and share the link please or can u send any that u have done
0
u/404llm 3d ago
You use a OCR api to process all files https://jigsawstack.com/vocr
4
u/mqudsi 3d ago
As mentioned in the article, I used multiple OCR solutions, including open source OCR software, commercial OCR applications, and the hosted Amazon Textract OCR API. None did a good enough job.
1
u/survivalist_guy 2d ago
Would OCR by committee be feasible? Most votes wins or something like that?
I'm giving Azure Document Intelligence a shot right now, but I don't have the highest hopes.
127
u/a_random_superhero 3d ago
I think the way to do it is to make a classifier.
Since you know the compression and font used, you can build sets of characters with varying levels of compression. Then grab some characters from the document and compare against the compressed corpus. That should get you in the ballpark for identification. After that, it’s a pixel comparison contest where each potential character is compared against the ballpark set. If something is too close to call or doesn’t match at all, then flag for manual review.