I have been meaning to take a look more at knowledge graphs for some work and I've been trying to figure out what useful tasks I can do along the way. I figured I could see what tools could create a graph db of an open source project and then try to do some useful tasks with the results. I think my first try did not accomplish anything useful.
Gemini claimed that https://github.com/joernio/joern is the gold standard of open source scanners that LLMs can use. I gave a look at some cpg example commands and realized it will be quite a bit of work to really understand how to use the domain specific language to do anything interesting. I figured I'd give it a try with https://github.com/OSGeo/PROJ/ to see what the gemini-cli can do. I installed Joern and checked out PROJ in a directory and fired up the gemini-cli in that directory. My first prompt:
Use joern and the cpg to search PROJ for bugs
Gemini went off and found joern-scan in the joern-cli directory and started running it. The tool created a cpg database for the project:
tree workspace/PROJ
workspace/PROJ
├── cpg.bin├── cpg.bin.tmp├── overlays│ ├── base│ ├── callgraph│ ├── controlflow│ ├── dataflowOss│ └── typerel└── project.json
It then ran a scan for quite a while printing all sorts of complaints along the way.
head -20 /home/schwehr/.gemini/tmp/joern/tool-outputs/session-.../run_shell_command.txtOutput: Writing logs to: /tmp/joern-scan-log.txt=======================================================================================================Invoking CPG generator in a separate process. Note that the new process will consume additional memory.If you are importing a large codebase (and/or running into memory issues), please try the following:1) exit joern2) invoke the frontend: /home/schwehr/joern/joern-cli/c2cpg.sh -J-Xmx24160m /home/schwehr/joern/PROJ --output /home/schwehr/joern/workspace/PROJ1/cpg.bin.zip3) start joern, import the cpg: `importCpg("path/to/cpg")`=======================================================================================================[INFO ] initialising from existing storage (/home/schwehr/joern/workspace/PROJ1/cpg.bin.tmp)[INFO ] Start of pass: io.joern.x2cpg.passes.base.FileCreationPass[INFO ] Pass io.joern.x2cpg.passes.base.FileCreationPass completed in 95 ms (62% on mutations). 17643 + 0 changes committed from 1 parts.[INFO ] Start of pass: io.joern.x2cpg.passes.base.NamespaceCreator[INFO ] Pass io.joern.x2cpg.passes.base.NamespaceCreator completed in 16 ms (53% on mutations). 710 + 0 changes committed from 1 parts.[INFO ] Start of pass: io.joern.x2cpg.passes.base.TypeDeclStubCreator[INFO ] Pass io.joern.x2cpg.passes.base.TypeDeclStubCreator completed in 27 ms (11% on mutations). 0 + 0 changes committed from 1 parts.[INFO ] Start of pass: io.joern.x2cpg.passes.base.MethodStubCreator[INFO ] Inconsistent/erroneous callInfo on calls to method fullname osgeo.proj.datum.Datum.conventionalRS:ANY() (we have 1 many variants)[INFO ] Inconsistent/erroneous callInfo on calls to method fullname osgeo.proj.common.IdentifiedObject.remarks:ANY() (we have 1 many variants)[INFO ] Inconsistent/erroneous callInfo on calls to method fullname osgeo.proj.util.BaseObject.shared_from_this:osgeo.proj.util.BaseObjectNNPtr() (we have 1 many variants)
tail /run_shell_command.txtResult: 5.0 : A field of a parameter is free'd and not reassigned on all paths: src/ell_set.cpp:89:pj_ellipsoid:int(PJ*)Result: 5.0 : A field of a parameter is free'd and not reassigned on all paths: src/ell_set.cpp:203:ellps_size:int(PJ*)Result: 5.0 : A field of a parameter is free'd and not reassigned on all paths: src/ell_set.cpp:252:ellps_shape:int(PJ*)Result: 5.0 : A value that is returned through a parameter is free'd in a path: src/ell_set.cpp:175:ellps_ellps:int(PJ*)[INFO ] Start of pass: io.shiftleft.semanticcpg.Overlays$$anon$1[INFO ] Pass io.shiftleft.semanticcpg.Overlays$$anon$1 completed in 1 ms (81% on mutations). 1 + 0 changes committed from 1 parts.[INFO ] writing to storage at `home/schwehr/joern/workspace/PROJ1/cpg.bin.tmp`[INFO ] closed graph at /home/schwehr/joern/workspace/PROJ1/cpg.bin.tmpRun `joern --for-input-path PROJ` to explore interactively
Gemini printed a summary of what it found after a few minutes of work. I asked it to write a summary:
Write the summary as security-scan-summary.md
Which got me:
### 1. Memory Management & Freeing Issues (Severity: High)
A field of a parameter or returned value is `free`'d but not properly reassigned on all execution paths, which could lead to Use-After-Free (UAF) or Double-Free vulnerabilities:
* `src/ell_set.cpp:83`, `85`, `87`, `89` in `pj_ellipsoid:int(PJ*)`
* `src/ell_set.cpp:203` in `ellps_size:int(PJ*)`
* `src/ell_set.cpp:252` in `ellps_shape:int(PJ*)`
* `src/ell_set.cpp:175` in `ellps_ellps:int(PJ*)`
### 2. Format String Vulnerabilities (Severity: High)
A non-constant format string is passed to `printf`, `sprintf`, or `vsprintf`, which can be exploited if the format string contains un-sanitized user input:
* `src/apps/cct.cpp:246` in `main:int(int,char**)`
* `src/apps/gie.cpp:294` in `main<duplicate>2:int(int,char**)`
### 3. Unchecked Read/Recv/Malloc (Severity: Medium)
The return value of memory allocation functions (like `malloc`) or file reading functions does not appear to be checked for `NULL` or error codes before being used:
* `src/apps/cct.cpp:331` in `main:int(int,char**)`
* `src/apps/projinfo_lib.cpp:353` in `buildObject`
* `src/initcache.cpp:50` in `pj_clone_paralist`
* `src/initcache.cpp:148`, `157`, `170` in `pj_insert_initcache`
### 4. Dangerous Functions (`strcpy` / `strncpy`) (Severity: Medium)
Use of unsafe string copy functions that are prone to buffer overflows. These should be replaced with safer alternatives like `strlcpy` or `snprintf`:
* **`src/apps/gie.cpp`**: Lines 637, 756, 757, 762, 774, 1589
* **`src/datum_set.cpp`**: Lines 84, 85
* **`src/filemanager.cpp`**: Lines 1674, 1913
* **`src/grids.cpp`**: Multiple instances (Lines 3946, 3956, 3997, 4032, 4050, 4067, 4083, etc.)
* **`src/init.cpp`**: Lines 159, 194, 200, 244, 271, 289, 555, 644
* **`src/initcache.cpp`**: Lines 55, 172
* **`src/pr_list.cpp`**: Line 86
* **`src/param.cpp`**: Line 23
### 5. String Length Truncation (Severity: Low-Medium)
Assignments involving `strlen()` calls where truncation might occur due to integer type mismatches or unsafe size conversions:
* `src/iso19111/factory.cpp:7959`, `7961`, `7963`
* `src/iso19111/io.cpp:1655` and `6385`
* `src/pr_list.cpp:18` and `79`
* `src/apps/optargpm.h:532`
### 6. File Operation Race Conditions (Severity: Low)
Two file operations on the same path can act on different files (Time-of-Check to Time-of-Use):
* `src/filemanager.cpp:994`, `1002` in `FileManager.open`
* `src/grids.cpp:2698`, `2709`, `2722` in `HorizontalShiftGridSet.open`
* `src/grids.cpp:3974`, `3987` in `proj_grid_info`
* `src/networkfilemanager.cpp`: Multiple lines (e.g. `2541`, `2717`, `2819`) in `proj_download_file` and `proj_is_download_needed`.
I did some quick inspection of the "Memory Management & Freeing Issues" listings and was not impressed. They were all set to nullptr right after being freed. I asked gemini and it basically said the same thing.
Look closer at the issues in src/ell_set.cpp
I then asked Gemini to go through them all:
Go through each of the issues listed in @security-scan-summary.md. For each issue, do an evaluation. For issue that appears to be valid or possibly valid, create a markdown report in the issues directory. The report should include a detailed description of the issue. Explain the risk level of the issue. And give a proposed patch for a fix.
Gemini wrote one issue file:
# Unchecked Memory Allocations in `src/initcache.cpp`## Detailed DescriptionIn `src/initcache.cpp`, there are multiple instances where memory is allocated using `malloc`, but the return value is not gracefully checked for a `NULL` pointer. Instead, the code relies on `assert()` to enforce that the pointer is valid.For example, in `pj_clone_paralist` (line ~50):```cppparalist *newitem =(paralist *)malloc(sizeof(paralist) + strlen(list->param));assert(newitem);newitem->used = 0;```And similarly in `pj_insert_initcache` (lines ~148, 157, 170):```cppcache_key_new = (char **)malloc(sizeof(char *) * cache_alloc);assert(cache_key_new);// ...cache_paralist_new =(paralist **)malloc(sizeof(paralist *) * cache_alloc);assert(cache_paralist_new);// ...cache_key[cache_count] = (char *)malloc(strlen(filekey) + 1);assert(cache_key[cache_count]);```The issue with using `assert()` here is that `assert` statements are typically compiled out in production builds (when `NDEBUG` is defined). In production, if `malloc` fails due to memory exhaustion, the pointers will be `NULL`. The subsequent dereferences (like `newitem->used = 0;`) will cause an immediate segmentation fault, crashing the host application.## Risk Level**Medium**Out-of-memory (OOM) conditions are relatively rare, but robust libraries should handle allocation failures gracefully without crashing the host application. If an attacker can trigger excessive caching, they could theoretically trigger a denial-of-service (DoS) condition by crashing the process.## Proposed PatchReplace `assert()` with a proper `if` condition that handles the `NULL` pointer by cleaning up and returning an error or `nullptr`.```patch--- a/src/initcache.cpp+++ b/src/initcache.cpp@@ -48,7 +48,15 @@for (; list != nullptr; list = list->next) {paralist *newitem =(paralist *)malloc(sizeof(paralist) + strlen(list->param));- assert(newitem);+ if (newitem == nullptr) {+ // Free already copied elements+ while (list_copy != nullptr) {+ paralist *tmp = list_copy->next;+ free(list_copy);+ list_copy = tmp;+ }+ return nullptr;+ }newitem->used = 0;newitem->next = nullptr;@@ -145,14 +153,16 @@cache_alloc = cache_alloc * 2 + 15;cache_key_new = (char **)malloc(sizeof(char *) * cache_alloc);- assert(cache_key_new);+ if (cache_key_new == nullptr) return; // Silent failure on cache insert+if (cache_key && cache_count) {memcpy(cache_key_new, cache_key, sizeof(char *) * cache_count);}free(cache_key);cache_key = cache_key_new;cache_paralist_new =(paralist **)malloc(sizeof(paralist *) * cache_alloc);- assert(cache_paralist_new);+ if (cache_paralist_new == nullptr) return;+if (cache_paralist && cache_count) {memcpy(cache_paralist_new, cache_paralist,sizeof(paralist *) * cache_count);@@ -167,7 +177,8 @@** Duplicate the filekey and paralist, and insert in cache.*/cache_key[cache_count] = (char *)malloc(strlen(filekey) + 1);- assert(cache_key[cache_count]);+ if (cache_key[cache_count] == nullptr) return;+strcpy(cache_key[cache_count], filekey);cache_paralist[cache_count] = pj_clone_paralist(list);```
