Saturday, April 25, 2026

Using joern-scan and Gemini on PROJ

 I have been meaning to take a look more at knowledge graphs for some work and I've been trying to figure out what useful tasks I can do along the way. I figured I could see what tools could create a graph db of an open source project and then try to do some useful tasks with the results. I think my first try did not accomplish anything useful.

Gemini claimed that https://github.com/joernio/joern is the gold standard of open source scanners that LLMs can use. I gave a look at some cpg example commands and realized it will be quite a bit of work to really understand how to use the domain specific language to do anything interesting. I figured I'd give it a try with https://github.com/OSGeo/PROJ/ to see what the gemini-cli can do. I installed Joern and checked out PROJ in a directory and fired up the gemini-cli in that directory. My first prompt:

Use joern and the cpg to search PROJ for bugs

Gemini went off and found joern-scan in the joern-cli directory and started running it. The tool created a cpg database for the project:

tree workspace/PROJ

workspace/PROJ

├── cpg.bin
├── cpg.bin.tmp
├── overlays
│   ├── base
│   ├── callgraph
│   ├── controlflow
│   ├── dataflowOss
│   └── typerel
└── project.json

 It then ran a scan for quite a while printing all sorts of complaints along the way.

head -20 /home/schwehr/.gemini/tmp/joern/tool-outputs/session-.../run_shell_command.txt
Output: Writing logs to: /tmp/joern-scan-log.txt
=======================================================================================================
Invoking CPG generator in a separate process. Note that the new process will consume additional memory.
If you are importing a large codebase (and/or running into memory issues), please try the following:
1) exit joern
2) invoke the frontend: /home/schwehr/joern/joern-cli/c2cpg.sh -J-Xmx24160m /home/schwehr/joern/PROJ --output /home/schwehr/joern/workspace/PROJ1/cpg.bin.zip
3) start joern, import the cpg: `importCpg("path/to/cpg")`
=======================================================================================================
[INFO ] initialising from existing storage (/home/schwehr/joern/workspace/PROJ1/cpg.bin.tmp)
[INFO ] Start of pass: io.joern.x2cpg.passes.base.FileCreationPass
[INFO ] Pass io.joern.x2cpg.passes.base.FileCreationPass completed in 95 ms (62% on mutations). 17643 + 0 changes committed from 1 parts.
[INFO ] Start of pass: io.joern.x2cpg.passes.base.NamespaceCreator
[INFO ] Pass io.joern.x2cpg.passes.base.NamespaceCreator completed in 16 ms (53% on mutations). 710 + 0 changes committed from 1 parts.
[INFO ] Start of pass: io.joern.x2cpg.passes.base.TypeDeclStubCreator
[INFO ] Pass io.joern.x2cpg.passes.base.TypeDeclStubCreator completed in 27 ms (11% on mutations). 0 + 0 changes committed from 1 parts.
[INFO ] Start of pass: io.joern.x2cpg.passes.base.MethodStubCreator
[INFO ] Inconsistent/erroneous callInfo on calls to method fullname osgeo.proj.datum.Datum.conventionalRS:ANY() (we have 1 many variants)
[INFO ] Inconsistent/erroneous callInfo on calls to method fullname osgeo.proj.common.IdentifiedObject.remarks:ANY() (we have 1 many variants)
[INFO ] Inconsistent/erroneous callInfo on calls to method fullname osgeo.proj.util.BaseObject.shared_from_this:osgeo.proj.util.BaseObjectNNPtr() (we have 1 many variants)

 

tail /run_shell_command.txt
Result: 5.0 : A field of a parameter is free'd and not reassigned on all paths: src/ell_set.cpp:89:pj_ellipsoid:int(PJ*)
Result: 5.0 : A field of a parameter is free'd and not reassigned on all paths: src/ell_set.cpp:203:ellps_size:int(PJ*)
Result: 5.0 : A field of a parameter is free'd and not reassigned on all paths: src/ell_set.cpp:252:ellps_shape:int(PJ*)
Result: 5.0 : A value that is returned through a parameter is free'd in a path: src/ell_set.cpp:175:ellps_ellps:int(PJ*)
[INFO ] Start of pass: io.shiftleft.semanticcpg.Overlays$$anon$1
[INFO ] Pass io.shiftleft.semanticcpg.Overlays$$anon$1 completed in 1 ms (81% on mutations). 1 + 0 changes committed from 1 parts.
[INFO ] writing to storage at `home/schwehr/joern/workspace/PROJ1/cpg.bin.tmp`
[INFO ] closed graph at /home/schwehr/joern/workspace/PROJ1/cpg.bin.tmp
Run `joern --for-input-path PROJ` to explore interactively 

Gemini printed a summary of what it found after a few minutes of work. I asked it to write a summary:

Write the summary as security-scan-summary.md

Which got me:

### 1. Memory Management & Freeing Issues (Severity: High)
A field of a parameter or returned value is `free`'d but not properly reassigned on all execution paths, which could lead to Use-After-Free (UAF) or Double-Free vulnerabilities:
* `src/ell_set.cpp:83`, `85`, `87`, `89` in `pj_ellipsoid:int(PJ*)`
* `src/ell_set.cpp:203` in `ellps_size:int(PJ*)`
* `src/ell_set.cpp:252` in `ellps_shape:int(PJ*)`
* `src/ell_set.cpp:175` in `ellps_ellps:int(PJ*)`

### 2. Format String Vulnerabilities (Severity: High)
A non-constant format string is passed to `printf`, `sprintf`, or `vsprintf`, which can be exploited if the format string contains un-sanitized user input:
* `src/apps/cct.cpp:246` in `main:int(int,char**)`
* `src/apps/gie.cpp:294` in `main<duplicate>2:int(int,char**)`

### 3. Unchecked Read/Recv/Malloc (Severity: Medium)
The return value of memory allocation functions (like `malloc`) or file reading functions does not appear to be checked for `NULL` or error codes before being used:
* `src/apps/cct.cpp:331` in `main:int(int,char**)`
* `src/apps/projinfo_lib.cpp:353` in `buildObject`
* `src/initcache.cpp:50` in `pj_clone_paralist`
* `src/initcache.cpp:148`, `157`, `170` in `pj_insert_initcache`

### 4. Dangerous Functions (`strcpy` / `strncpy`) (Severity: Medium)
Use of unsafe string copy functions that are prone to buffer overflows. These should be replaced with safer alternatives like `strlcpy` or `snprintf`:
* **`src/apps/gie.cpp`**: Lines 637, 756, 757, 762, 774, 1589
* **`src/datum_set.cpp`**: Lines 84, 85
* **`src/filemanager.cpp`**: Lines 1674, 1913
* **`src/grids.cpp`**: Multiple instances (Lines 3946, 3956, 3997, 4032, 4050, 4067, 4083, etc.)
* **`src/init.cpp`**: Lines 159, 194, 200, 244, 271, 289, 555, 644
* **`src/initcache.cpp`**: Lines 55, 172
* **`src/pr_list.cpp`**: Line 86
* **`src/param.cpp`**: Line 23

### 5. String Length Truncation (Severity: Low-Medium)
Assignments involving `strlen()` calls where truncation might occur due to integer type mismatches or unsafe size conversions:
* `src/iso19111/factory.cpp:7959`, `7961`, `7963`
* `src/iso19111/io.cpp:1655` and `6385`
* `src/pr_list.cpp:18` and `79`
* `src/apps/optargpm.h:532`

### 6. File Operation Race Conditions (Severity: Low)
Two file operations on the same path can act on different files (Time-of-Check to Time-of-Use):
* `src/filemanager.cpp:994`, `1002` in `FileManager.open`
* `src/grids.cpp:2698`, `2709`, `2722` in `HorizontalShiftGridSet.open`
* `src/grids.cpp:3974`, `3987` in `proj_grid_info`
* `src/networkfilemanager.cpp`: Multiple lines (e.g. `2541`, `2717`, `2819`) in `proj_download_file` and `proj_is_download_needed`. 

I did some quick inspection of the "Memory Management & Freeing Issues" listings and was not impressed. They were all set to nullptr right after being freed. I asked gemini and it basically said the same thing.

Look closer at the issues in src/ell_set.cpp 

I then asked Gemini to go through them all:

Go through each of the issues listed in @security-scan-summary.md. For each issue, do an evaluation. For issue that appears to be valid or possibly valid, create a markdown report in the issues directory. The report should include a detailed description of the issue. Explain the risk level of the issue. And give a proposed patch for a fix.

Gemini wrote one issue file:

# Unchecked Memory Allocations in `src/initcache.cpp`
## Detailed Description
In `src/initcache.cpp`, there are multiple instances where memory is allocated using `malloc`, but the return value is not gracefully checked for a `NULL` pointer. Instead, the code relies on `assert()` to enforce that the pointer is valid.
For example, in `pj_clone_paralist` (line ~50):
```cpp
        paralist *newitem =
            (paralist *)malloc(sizeof(paralist) + strlen(list->param));
        assert(newitem);
        newitem->used = 0;
```
And similarly in `pj_insert_initcache` (lines ~148, 157, 170):
```cpp
        cache_key_new = (char **)malloc(sizeof(char *) * cache_alloc);
        assert(cache_key_new);
        // ...
        cache_paralist_new =
            (paralist **)malloc(sizeof(paralist *) * cache_alloc);
        assert(cache_paralist_new);
        // ...
        cache_key[cache_count] = (char *)malloc(strlen(filekey) + 1);
        assert(cache_key[cache_count]);
```
The issue with using `assert()` here is that `assert` statements are typically compiled out in production builds (when `NDEBUG` is defined). In production, if `malloc` fails due to memory exhaustion, the pointers will be `NULL`. The subsequent dereferences (like `newitem->used = 0;`) will cause an immediate segmentation fault, crashing the host application.
## Risk Level
**Medium**
Out-of-memory (OOM) conditions are relatively rare, but robust libraries should handle allocation failures gracefully without crashing the host application. If an attacker can trigger excessive caching, they could theoretically trigger a denial-of-service (DoS) condition by crashing the process.
## Proposed Patch
Replace `assert()` with a proper `if` condition that handles the `NULL` pointer by cleaning up and returning an error or `nullptr`.
```patch
--- a/src/initcache.cpp
+++ b/src/initcache.cpp
@@ -48,7 +48,15 @@
     for (; list != nullptr; list = list->next) {
         paralist *newitem =
             (paralist *)malloc(sizeof(paralist) + strlen(list->param));
-        assert(newitem);
+        if (newitem == nullptr) {
+            // Free already copied elements
+            while (list_copy != nullptr) {
+                paralist *tmp = list_copy->next;
+                free(list_copy);
+                list_copy = tmp;
+            }
+            return nullptr;
+        }
 
         newitem->used = 0;
         newitem->next = nullptr;
@@ -145,14 +153,16 @@
         cache_alloc = cache_alloc * 2 + 15;
 
         cache_key_new = (char **)malloc(sizeof(char *) * cache_alloc);
-        assert(cache_key_new);
+        if (cache_key_new == nullptr) return; // Silent failure on cache insert
+        
         if (cache_key && cache_count) {
             memcpy(cache_key_new, cache_key, sizeof(char *) * cache_count);
         }
         free(cache_key);
         cache_key = cache_key_new;
 
         cache_paralist_new =
             (paralist **)malloc(sizeof(paralist *) * cache_alloc);
-        assert(cache_paralist_new);
+        if (cache_paralist_new == nullptr) return;
+        
         if (cache_paralist && cache_count) {
             memcpy(cache_paralist_new, cache_paralist,
                    sizeof(paralist *) * cache_count);
@@ -167,7 +177,8 @@
     ** Duplicate the filekey and paralist, and insert in cache.
     */
     cache_key[cache_count] = (char *)malloc(strlen(filekey) + 1);
-    assert(cache_key[cache_count]);
+    if (cache_key[cache_count] == nullptr) return;
+    
     strcpy(cache_key[cache_count], filekey);
 
     cache_paralist[cache_count] = pj_clone_paralist(list);
```

I rate that report as not very interesting. Gemini's conclusion:



So all that and no real value added. Now I have to wonder what non-security check things can be done with the cpg file? Can it aid an LLM cli in making normal code changes? I'm thinking things like adding new features, improving performance, refactoring, checking numerical stability, etc.?

No comments:

Post a Comment