Skip to content

Commit d734724

Browse files
committed
Skill for parser creator
1 parent c8a27d4 commit d734724

File tree

18 files changed

+1332
-0
lines changed

18 files changed

+1332
-0
lines changed
Lines changed: 312 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,312 @@
1+
---
2+
name: create-sc4s-parser
3+
description: DEPRECATED DON'T USE IT!. Creates SC4S syslog-ng parsers using a TDD workflow. Use when the user wants to create a new parser, add support for a new log source or vendor, or says "create parser", "add parser", "new log source", or "new vendor support".
4+
---
5+
6+
# SC4S Parser Creation (TDD Workflow)
7+
8+
Create syslog-ng parsers for SC4S using test-driven development: write the test first, then the parser, then validate against the full test suite.
9+
10+
## Prerequisites
11+
12+
Before starting, gather from the user:
13+
14+
1. **vendor** — vendor name (lowercase, e.g. `acme`)
15+
2. **product** — product name (lowercase, e.g. `firewall`)
16+
3. **sourcetype** — Splunk sourcetype (e.g. `acme:firewall`)
17+
4. **index** — target Splunk index (e.g. `netfw`, `netops`, `netauth`, `netproxy`, `epintel`)
18+
5. **sample_logs** — one or more raw syslog messages
19+
6. **parser_type** (optional) — `syslog` (default), `almost-syslog`, `cef`, `netsource`
20+
21+
If any are missing, ask the user. Use the AskQuestion tool for structured input when available.
22+
23+
## Workflow Checklist
24+
25+
Copy this and track progress:
26+
27+
```
28+
Parser: {vendor}_{product}
29+
- [ ] Step 1: Analyze sample logs
30+
- [ ] Step 2: Review existing parsers for conflicts
31+
- [ ] Step 3: Create parser .conf for lite package and main package
32+
- [ ] Step 4: Create addon_metadata.yaml (if new vendor)
33+
- [ ] Step 5: Create unit test
34+
- [ ] Step 6: Run the parser test
35+
- [ ] Step 7: Run regression tests
36+
```
37+
38+
---
39+
40+
## Step 1: Analyze Sample Logs
41+
42+
Examine the raw sample logs to determine:
43+
44+
1. **Syslog format**:
45+
- RFC3164: `<PRI>TIMESTAMP HOSTNAME PROGRAM: MESSAGE`
46+
- RFC5424: `<PRI>VERSION TIMESTAMP HOSTNAME APP-NAME PROCID MSGID SDATA MESSAGE`
47+
- CEF: `<PRI>TIMESTAMP HOSTNAME CEF:0|<Device Vendor>|<Device Product>|<Device Version>|<Signature ID>|<Name>|<Severity>|<Extension fields>`
48+
49+
2. **Identifying features** (what makes this log unique):
50+
- PROGRAM field (e.g. `swlogd`, `CISE_`, `%ASA-`)
51+
- Message content patterns (e.g. `devid=FG`, `1,TIMESTAMP,SERIAL,TRAFFIC,`)
52+
- Structured data fields
53+
54+
3. **Filter strategy** — choose the narrowest filter:
55+
- If the log is in CEF format: use `<Device Vendor>` and `<Device Product>` fields
56+
- If the log has a unique PROGRAM: use `program()` filter with `sc4s-syslog-pgm` topic
57+
- If PROGRAM is empty but message has unique prefix/pattern: use `message()` filter with `sc4s-syslog` topic
58+
- If identification requires a dedicated port: use `sc4s-network-source` topic
59+
60+
4. **Template selection**:
61+
- `t_hdr_msg` — most common, includes MSGHDR + MESSAGE
62+
- `t_msg_only` — message only (no header), used for CSV formats like Palo Alto
63+
- `t_5424_hdr_sdata_compact` — RFC5424 with structured data
64+
- `t_hdr_sdata_msg` — header + SDATA + message (Juniper)
65+
- `t_kv_values` — key-value formatted output
66+
- `t_json_values` — JSON formatted extracted values
67+
68+
69+
For full template and parser pattern reference, see [parser-reference.md](parser-reference.md).
70+
71+
## Step 2: Review Existing Parsers for Conflicts
72+
73+
Before writing the filter, check that no existing parser uses a filter that would match these logs:
74+
75+
1. Search for similar `program()` filters in `package/lite/etc/addons/`:
76+
```
77+
grep -r "program(" package/lite/etc/addons/ --include="*.conf"
78+
```
79+
80+
2. Search for similar `message()` patterns:
81+
```
82+
grep -r "message(" package/lite/etc/addons/ --include="*.conf"
83+
```
84+
85+
3. If there IS a conflict, narrow your filter or discuss with the user.
86+
87+
## Step 3: Create Parser Configuration
88+
89+
For main package create `package/etc/conf.d/conflib/{parser_type}/app-{parser_type}-{vendor}_{prodict}.conf`.
90+
91+
For lite package create `package/lite/etc/addons/{vendor}/app-{parser_type}-{vendor}_{product}.conf`.
92+
93+
**Parser file template (simple case with PROGRAM filter):**
94+
95+
```
96+
block parser app-{parser_type}-{vendor}_{product}() {
97+
channel {
98+
rewrite {
99+
r_set_splunk_dest_default(
100+
index('{INDEX}')
101+
sourcetype('{SOURCETYPE}')
102+
vendor("{VENDOR}")
103+
product("{PRODUCT}")
104+
template('{TEMPLATE}')
105+
);
106+
};
107+
};
108+
};
109+
application app-{parser_type}-{vendor}_{product}[sc4s-syslog-pgm] {
110+
filter {
111+
program('{PROGRAM_MATCH}' type(string) flags(prefix));
112+
};
113+
parser { app-{parser_type}-{vendor}_{product}(); };
114+
};
115+
```
116+
117+
**Parser file template (message-based filter, no PROGRAM):**
118+
119+
```
120+
block parser app-{parser_type}-{vendor}_{product}() {
121+
channel {
122+
rewrite {
123+
r_set_splunk_dest_default(
124+
index('{INDEX}')
125+
sourcetype('{SOURCETYPE}')
126+
vendor("{VENDOR}")
127+
product("{PRODUCT}")
128+
template('{TEMPLATE}')
129+
);
130+
};
131+
};
132+
};
133+
application app-{parser_type}-{vendor}_{product}[sc4s-syslog] {
134+
filter {
135+
"{dollar}PROGRAM" eq ""
136+
and message('{MESSAGE_PATTERN}');
137+
};
138+
parser { app-{parser_type}-{vendor}_{product}(); };
139+
};
140+
```
141+
142+
**Important:**
143+
- Use `sc4s-syslog-pgm` topic when matching on PROGRAM field.
144+
- Use `sc4s-syslog` topic when matching on message content with empty PROGRAM.
145+
- Use `sc4s-network-source` topic for port-based identification.
146+
- Use `cef` topic for cef logs
147+
- For complex parsers with field extraction, see [parser-reference.md](parser-reference.md).
148+
149+
## Step 4: Create addon_metadata.yaml (if new vendor)
150+
151+
If the vendor directory doesn't exist yet, create `package/lite/etc/addons/{vendor}/addon_metadata.yaml`:
152+
153+
```yaml
154+
---
155+
name: "{vendor}"
156+
```
157+
## Step 5: Create Unit Test
158+
159+
Create `tests/test_{vendor}_{product}.py` following the test pattern.
160+
For the full test template, see [test-reference.md](test-reference.md).
161+
162+
**Key rules for test creation:**
163+
164+
1. **Templatize** the raw sample logs:
165+
- Replace the PRI value with `{{ mark }}`
166+
- Replace the BSD timestamp with `{{ bsd }}`
167+
- Replace the hostname with `{{ host }}`
168+
- Keep the message body as-is from the sample
169+
170+
2. **Imports** — always use:
171+
```python
172+
from jinja2 import Environment, select_autoescape
173+
from .sendmessage import sendsingle
174+
from .splunkutils import splunk_single
175+
from .timeutils import time_operations
176+
import datetime
177+
import pytest
178+
import shortuuid
179+
```
180+
181+
3. **Test marker** — use `@pytest.mark.addons("{vendor_addon_dir}")` where `{vendor_addon_dir}` matches the addon directory name under `package/lite/etc/addons/`.
182+
183+
4. **Search query** must include: `index`, `_time={{ epoch }}`, `sourcetype`, and `host`.
184+
185+
5. **Assert** `result_count == 1` for each test event.
186+
187+
6. **Multiple test functions** — create one test per log variant (different subtypes, formats, etc.)
188+
189+
**Test file template (simple case):**
190+
191+
```python
192+
# Copyright 2019 Splunk, Inc.
193+
#
194+
# Use of this source code is governed by a BSD-2-clause-style
195+
# license that can be found in the LICENSE-BSD2 file or at
196+
# https://opensource.org/licenses/BSD-2-Clause
197+
import shortuuid
198+
199+
from jinja2 import Environment, select_autoescape
200+
from .sendmessage import sendsingle
201+
from .splunkutils import splunk_single
202+
from .timeutils import time_operations
203+
import datetime
204+
import pytest
205+
206+
env = Environment(autoescape=select_autoescape(default_for_string=False))
207+
208+
209+
# Paste the original raw log as a comment above testdata
210+
# <134>Feb 18 09:37:41 myhost myprogram: some message content
211+
testdata = [
212+
"{{ mark }}{{ bsd }} {{ host }} myprogram: some message content",
213+
]
214+
215+
216+
@pytest.mark.addons("{VENDOR_ADDON_DIR}")
217+
@pytest.mark.parametrize("event", testdata)
218+
def test_{vendor}_{product}(
219+
record_property, get_host_key, setup_splunk, setup_sc4s, event
220+
):
221+
host = get_host_key
222+
223+
dt = datetime.datetime.now()
224+
_, bsd, _, _, _, _, epoch = time_operations(dt)
225+
226+
# Tune time functions
227+
epoch = epoch[:-7]
228+
229+
mt = env.from_string(event + "\n")
230+
message = mt.render(mark="<134>", bsd=bsd, host=host)
231+
232+
sendsingle(message, setup_sc4s[0], setup_sc4s[1][514])
233+
234+
st = env.from_string(
235+
'search index={INDEX} _time={{ epoch }} sourcetype="{SOURCETYPE}" (host="{{ host }}" OR "{{ host }}")'
236+
)
237+
search = st.render(epoch=epoch, host=host)
238+
239+
result_count, _ = splunk_single(setup_splunk, search)
240+
241+
record_property("host", host)
242+
record_property("resultCount", result_count)
243+
record_property("message", message)
244+
245+
assert result_count == 1
246+
```
247+
248+
Replace `{VENDOR_ADDON_DIR}`, `{INDEX}`, `{SOURCETYPE}`, `{vendor}`, `{product}` with actual values.
249+
250+
## Step 6: Run the Parser Test
251+
252+
Run only the new test to verify the parser catches the sample logs:
253+
254+
```bash
255+
poetry run pytest tests/test_{vendor}_{product}.py -v --tb=long \
256+
-k "test_{vendor}_{product}" \
257+
--splunk_type=external \
258+
--sc4s_host=<SC4S_HOST> \
259+
--splunk_host=<SPLUNK_HOST>
260+
```
261+
262+
**If the test fails:**
263+
1. Read the error output carefully
264+
2. Common issues:
265+
- Filter too narrow → broaden the `program()` or `message()` match
266+
- Wrong application topic → switch between `sc4s-syslog-pgm` and `sc4s-syslog`
267+
- Wrong template → try `t_hdr_msg`, `t_msg_only`, or `t_legacy_hdr_msg`
268+
- Timestamp parsing issues → check if a custom `date-parser` is needed
269+
3. Fix the parser and re-run until the test passes
270+
271+
## Step 7: Run Regression Tests
272+
273+
Run the full test suite to ensure no existing parsers are broken:
274+
275+
```bash
276+
poetry run pytest tests/ -v --tb=long -n 14 \
277+
-k "not lite and not name_cache" \
278+
--splunk_type=external \
279+
--sc4s_host=<SC4S_HOST> \
280+
--splunk_host=<SPLUNK_HOST>
281+
```
282+
283+
**If an existing test fails:**
284+
- Your new parser's filter is too broad — it's matching logs from another vendor
285+
- Tighten the filter condition (add more specificity to `program()` or `message()` match)
286+
- Re-run until all tests pass
287+
288+
---
289+
290+
## Decision Guide: Application Topic
291+
292+
| Scenario | Topic | Example |
293+
|----------|-------|---------|
294+
| Log has a unique PROGRAM field | `sc4s-syslog-pgm` | `program('swlogd' type(string) flags(prefix))` |
295+
| PROGRAM is empty, message has pattern | `sc4s-syslog` | `message('^1,[^,]+,[^,]+,[A-Z]+\,')` |
296+
| Must use a dedicated port | `sc4s-network-source` | Port-based identification |
297+
| Log is RFC5424 with SDATA | `sc4s-syslog` | `match('\[vendor@' value("SDATA"))` |
298+
| Log is malformed / non-standard | `sc4s-almost-syslog` | Timestamp or header issues |
299+
300+
## Decision Guide: Filter Type
301+
302+
| Filter | When to use | Syntax |
303+
|--------|-------------|--------|
304+
| `program()` | PROGRAM field present and unique | `program('name' type(string) flags(prefix))` |
305+
| `message()` | Match on message content | `message('pattern' type(string) flags(prefix\|substring))` or regex |
306+
| `match()` | Match on a specific field | `match('value' value('.field.name'))` |
307+
| Combined | Multiple conditions needed | `program('X') or message('Y')` |
308+
309+
## Additional Resources
310+
311+
- For parser patterns with field extraction (CSV, KV, regex): [parser-reference.md](parser-reference.md)
312+
- For advanced test patterns (RFC5424, framed, multi-variant): [test-reference.md](test-reference.md)
Lines changed: 44 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,44 @@
1+
CEF (Common Event Format)
2+
3+
A CEF-formatted syslog message typically looks like this:
4+
```
5+
<PRI>TIMESTAMP HOSTNAME CEF:0|<Device Vendor>|<Device Product>|<Device Version>|<Signature ID>|<Name>|<Severity>|<Extension fields>
6+
```
7+
8+
**Fields:**
9+
- `CEF:0` — Literal, indicating CEF version 0
10+
- `<Device Vendor>` — Name of the vendor, e.g., `Guardicore`
11+
- `<Device Product>` — Product name, e.g., `Centra`
12+
- `<Device Version>` — Product version, e.g., `51`
13+
- `<Signature ID>` — Unique event identifier, e.g., `Network Log`
14+
- `<Name>` — Short description/name for the event, e.g., `Network Log`
15+
- `<Severity>` — Event severity, string or number, e.g., `None`
16+
- `<Extension fields>` — Key-value pairs (space-separated), e.g., `id=157d593c act=Allowed src=10.1.2.3 ...`
17+
18+
**Sample Raw CEF Message:**
19+
```
20+
<14>2023-02-20T22:01:00Z myhost CEF:0|Guardicore|Centra|51|Network Log|Network Log|None|id=157d593c act=Allowed src=10.1.1.1 dst=10.2.2.2 proto=TCP cs1Label=connection_type cs1=SUCCESSFUL
21+
```
22+
23+
- `<14>` — PRI value (syslog priority)
24+
- `2023-02-20T22:01:00Z` — RFC3339 timestamp
25+
- `myhost` — Hostname
26+
- `CEF:0|...` — CEF header and fields
27+
28+
**Common extension fields include:**
29+
- `id` (event ID), `act` (action), `cnt` (count), `src` (source address), `dst` (destination address), `dpt` (destination port), etc.
30+
31+
For parser development, use the [parser-reference.md](parser-reference.md) for appropriate filters and templates. CEF logs usually require a `[cef]` topic in the application block, and filtering based on `.metadata.cef.device_vendor` and `.metadata.cef.device_product`.
32+
33+
Example minimal application block for Guardicore Centra:
34+
```
35+
application app-cef-guardicore_centra[cef] {
36+
filter {
37+
match("Guardicore" value(".metadata.cef.device_vendor"))
38+
and match("Centra" value(".metadata.cef.device_product"));
39+
};
40+
parser { app-cef-guardicore_centra(); };
41+
};
42+
```
43+
44+
See also: [CEF reference documentation](https://community.microfocus.com/cyberres/pdf/cef.pdf) for full field definitions and best practices.

0 commit comments

Comments
 (0)