carom 3 days ago

Binary Ninja has an AI integration called side kick, it has a free trial but I'm not sure it can be used in the free web version. [1]

In my experience, the off the shelf LLMs (e.g. ChatGPT) do a pretty poor job with assembly, they can not reason about the stack or stack frames well.

I think your job will be the same with or without AI. Figuring out the data structures and data types a function is operating on and naming variables.

What are you reverse engineering for? For example, getting a full compilable decompilation has different goals than finding vulnerabilities or patching a bug.

1. https://sidekick.binary.ninja/

  • aidanhs 3 days ago

    Out of curiosity, what would you say the current state of the art is for full compilable decompilation? This is something I have a vague interest in but I'm not involved enough in the space to be on top of the latest and greatest tooling.

    • feznyng 3 days ago

      Echoing IDA but its pricing is a huge PITA if you’re using it in a hobbyist capacity i.e. you don’t have an employer willing to pay for it. Could opt for the home version but that’s a yearly cost and you have to use their cloud decompiler. Ghidra’s your best bet if you want something FOSS and community-driven although not as great at decompilation.

      • mdaniel 3 days ago

        Not only the pricing by itself, every story that I've heard about normal people trying to actually give them money is that they actually don't want to sell it to anyone other than big players

        That said, depending on ones needs they do actually offer a slimmed down IDA Free: https://hex-rays.com/ida-free

        I actually use AUR to more-or-less track its releases https://aur.archlinux.org/packages/ida-free

        • D4Ha a day ago

          Hexrays used to be difficult to deal with if you want to purchase IDA Pro for the first time, due to their software getting leaked online.

          They have eased the procedure to buy from them, but from time to time they'll ask you to fill out your info with national ID/passport (they say its because they don't want to sell their software to individuals under sanctions). This is despite them being based in Belgium (not the US).

          For any serious work IDA Pro is highly suitable (the customization and scripting, loader examples and processor plugins...etc), on the other hand for side projects and basic security research Binary ninja and ghidra can go along way.

    • carom 3 days ago

      Most decompilers do not strive for recompilability. [1] I believe there are (or were) some academic projects that aimed for recompilation as a core feature, but it is a hard problem.

      On the commercial side, IDA / HexRays [2] is very strong for C-like decompilation. If you're looking at Go, Rust, or even C++ it is going to be a little bit more messy. As other commenters have said, you'll work function-by-function and it is expensive, though the free version does have decompilation (F5) for x86 and x64 (IIRC).

      Binary Ninja [3] (no affiliation) is the coolest IMO, they have multiple intermediate representations they lift the assembly through. So you get like assembly -> low level IL -> medium level IL -> high level IL. There are also SSA forms (static single assignment) that can aid in programmatic analyses. The high level IL is very readable but makes no effort to be compilable as a programming language. That being said, Binary Ninja has implemented different "views" on the HLIL so you can show it as pseudo-C, Rust, etc. There is a free online version and the commercial version is cheaper than IDA but still expensive. Good Python API, good UI.

      Ghidra [4] is the RE framework released by NSA. It is free and open source. It supports a ton of niche architectures. This is what most people use. I think the UI is awful, personally. It has a decompiler, the results are OK. They have an intermediate representation (P-Code) and plugins are in Java (since it is written in Java). I haven't worked much with it.

      Most online decompilations you see for old games are likely using Ghidra, some might be using IDA. This is largely a manual process of doing a function at a time and building up the mental map of the program and how things interact.

      Also worth mentioning are lifters. There were a few projects that aimed to lift assembly to LLVM IR (compiler framework's intermediate representation), with the idea being that then all your analyses could be written over LLVM IR as a lingua franca. Since it is in LLVM IR, it would be also recompilable and retargetable. [5][6]

      1. https://reverseengineering.stackexchange.com/questions/2603/...

      2. https://hex-rays.com/ida-free

      3. https://binary.ninja/free/

      4. https://ghidra-sre.org/

      5. https://github.com/avast/retdec

      6. https://github.com/lifting-bits/mcsema

    • Retr0id 3 days ago

      Looking at an individual function, IDA hex-rays output is often recompilable as-is (or with minor modifications), but it won't necessarily be idiomatic, especially if you don't have symbol information.

  • th0ma5 3 days ago

    This is what I gather from reverse engineering material I've read and groups I've been around. Hidden state, hidden data structures, hidden automations all abound, and there simply isn't enough detail in the assembler itself to bridge the hardware's internal conceptualization and processes.

JosephRedfern 3 days ago

These guys are building foundational models for this purpose: https://reveng.ai/. The results are quite compelling, and they have plugins for your favourite reverse engineering tools.

netsec_burn 3 days ago

I made a site to use LLMs to help me with reverse engineering. The output is surprisingly readable, even with C++ classes. Let me know any feedback you might have: https://decompiler.zeroday.engineering/

  • btown 3 days ago

    What kind of file should be uploaded?

    • netsec_burn 3 days ago

      The allowed types are a bit misleading. Any binary is accepted, any architecture. You can upload shared objects, ELF executables, PE binaries, etc.

      I like to give it bomb executables (reverse engineering challenges) to test it.

      • mdaniel 3 days ago

        > Any binary is accepted, any architecture.

        One should be careful tossing around the word "any" in relation to executable formats, for there are seemingly an unbounded number of them: https://github.com/1Password/onepassword-sdk-go/blob/v0.1.5/...

        Up to you, but currently your polling endpoint just has a boolean, which is likely super easy to cook on the server side but also leads the user left wondering "uh, is this thing on?" in ways that any kind of percentage might not. IOW, how long, exactly, should any sane person wait for it to be {"status":true}?

        Also, you have your ELB misconfigured because trying to upload a binary that is takes more than 30 seconds to upload causes the actual POST to puke. I'm sure that's great for hello-world.exe but is absolutely hilarious for any real binary

__alexander 3 days ago

Do you have experience reverse engineering? If not, LLMs are not going to help much. LLMs are useful for aiding the analysis but they don’t do the analysis.

  • uncomplexity_ 3 days ago

    Yea this one. If you have solid fundamentals these LLMs are really handy in assisting and never leading.

    For example I have a minified javascript file, way obfuscated. I can paste the code and make it break down the initial structure. And then I tell it which parts to focus on and which parts to dig in deeper.

lumb63 3 days ago

It has nothing to do with LLMs, but Ghidra is a wonderful tool.

mahaloz 2 days ago

I like using it for library function comments, variable name recovery, and sometimes types. The comments are usually hit or miss, but I find the variable names to be a bit better than auto-generated ones. I implement most of this in my decompiler plugin: https://github.com/mahaloz/DAILA; check it out if you are interested :).

Dwedit 3 days ago

Have you tried Ghidra yet? If you still have your debug symbols, then it can do a really good job.

flashgordon 3 days ago

Interesting. Wouldn't this actually be a deterministic problem based on graph analysis. Id have thought LLMs would have been more effective taking the out out some graph recognizer and then identifying what those higher level constructs map to?

  • warkdarrior 3 days ago

    Deterministic maybe, but surely undecidable in the general case since you need whole program analysis to understand, for example, the purpose of a memory location. ML may help approximate this undecidable problem.

stackghost 3 days ago

The Advent of Cyber side quest this year needed some Ghidra and I found Pickman's Model was pretty good at helping me craft a heap exploit from a decompilation.

userbinator 2 days ago

Unfortunately LLMs are not good at precision and details, which is exactly what you need for the sort of analysis you're trying to do.

  • menaerus a day ago

    Right. Have a look at the paper above from Meta on how they fine-tuned the Code Llama with LLVM IR to beat the compiler in producing size-optimized binaries.

apatheticonion 3 days ago

Inspired by the work out there that reverse engineers game engines, I've always wanted to try my hand at reverse engineering to contribute to the world of game preservation.

Is it actually legal to decompile a game engine from executables/dll files, write new sources by making sense of the output and rewriting it such that it can be compiled targeting modern APIs?

I feel like that must be illegal

feznyng 3 days ago

You could use the LLM to help you write utility scripts for whatever disassembler you’re using e.g. python for IDA. That might work better than feeding it raw assembly.

Game RE communities also have all sorts of neat utilities for decompiling large cpp binaries. Skyrim’s community is pretty active with ghidra/ida.

Guessing you’re not lucky enough to have a PDB?

  • tonetegeatinst 3 days ago

    PDB?

    • feznyng 3 days ago

      Program database file - only relevant if the binary is Windows. But that makes decomp an order of magnitude easier. I’d be surprised if OP had one though.

svilen_dobrev 2 days ago

cpp? that's a preprocessor. u mean c++?

LLM won't help you much if u can't understand what it's talking about.

Manual way is, given ELF (linux executable format) somexe,

$ strings somexe

$ objdump -d somexe

$ objdump -s -j .ro data somexe

then look+ponder over the results.

and/or running ghidra (as mouse'd UI) over it.. which may help somewhat but not 100%

Have in mind, that objdump and ghidra have opposite ways of showing assembly transfer/multi-operand instructions - one has mov dest,target , other has mov target,dest - for same code.

no idea on (recent) windoze front. IDA ?

sitkack 3 days ago

Do you know the compiler and what the source possibly looks like? I found LLMs are pretty good at recovering code from binaries, they need help though.

If you are able to run the program and collect traces, that will help a ton.

u53rn4m3 3 days ago

RevEng.AI have their own foundational AI models for decompilation with English language summaries.

seba_dos1 3 days ago

Good luck. If that's how you're approaching it, you're going to need it.

  • 2-3-7-43-1807 3 days ago

    op apparently never even heard about reddit

ianhawes 3 days ago

Highly recommend it. I reversed an app with o1 Pro Mode and the analysis of the obfuscated C# code matched up accurately with what I eventually discovered by manually reversing.

  • chc4 3 days ago

    Reverse engineering C# is extremely different from C++ binaries.