No description
Find a file
2025-06-29 21:40:15 +00:00
Dockerfile first commit 2025-06-29 19:16:28 +00:00
entrypoint.sh first commit 2025-06-29 19:16:28 +00:00
parse.py fix for copr 2025-06-29 21:40:15 +00:00
README.md first commit 2025-06-29 19:16:28 +00:00

Datagrepper Account Parser

This process works to extract from each message on the Fedora Infrastructure message bus a username mapped to that event. Some events are not supported at this time (for example Meetbot does not generate 1 record per person in the meeting, while it probably should).

Topic Pattern JSON
org.fedoraproject.prod.badges.badge.award% $.user.username
org.fedoraproject.prod.fedbadges% $.user.username
org.fedoraproject.prod.discourse.like% $.webhook_body.like.post.username
org.fedoraproject.prod.discourse.post% $.webhook_body.post.username
org.fedoraproject.prod.discourse.solved% $.webhook_body.solved.username
org.fedoraproject.prod.discourse.topic% $.webhook_body.topic.created_by.username
org.fedoraproject.prod.mailman% $.msg.from
org.fedoraproject.prod.planet% $.username
org.fedoraproject.prod.git% $.commit.username
org.fedoraproject.prod.fas% $.msg.user
org.fedoraproject.prod.openqa% $.user
org.fedoraproject.prod.bodhi.buildroot% $.override.submitter.name
org.fedoraproject.prod.bodhi.update.comment% $.comment.user.name
org.fedoraproject.prod.bodhi% $.update.user.name
org.fedoraproject.prod.bugzilla% $.event.user.login
org.fedoraproject.prod.waiver% $.username
org.fedoraproject.prod.fmn% $.user.name
org.fedoraproject.prod.buildsys% $.owner
org.fedoraproject.prod.copr% $.user
io.pagure.prod.pagure% $.agent
org.fedoraproject.prod.pagure.commit.flag% $.flag.user.name
org.centos.sig.integration.gitlab.redhat.centos-stream% $.user.name
org.fedoraproject.prod.wiki% $.user
org.release-monitoring.prod.anitya.% $.message.agent
org.fedoraproject.prod.maubot.cookie.give.% $.sender
org.fedoraproject.prod.kerneltest.upload.new% $.agent
org.fedoraproject.prod.fedocal% $.agent
org.centos.prod.buildsys% $.owner
org.fedoraproject.prod.badges.person.rank.advance% $.person.nickname

Only messages with non-null headers and body are processed. The extracted usernames are cleaned up to remove any extra characters, quotes etc. Any rows without a valid username are discarded.

The result is Parquet files containing only the essential fields:

  • sent_at timestamp
  • id of the message
  • topic of the event
  • username as the parsed username for the message

Output files are saved in the output_users directory you map for the container as fedora-{YYYYMMDD}_processed.parquet filename.

Usage

Build the container

docker build -t datagrepper-parse-accounts .

Run the container

docker run --rm \
  -e INPUT_DIR=/data/input \
  -e OUTPUT_DIR=/data/output_users \
  -v ~/data/fedora/datagrepper-raw:/data/input:ro \
  -v ~/data/fedora/datagrepper-users:/data/output_users \
  datagrepper-parse-accounts

Processed Parquet files will be saved to:

~/data/fedora/datagrepper-users

License

This project is licensed under the GNU General Public License v3.0.

Copyright © 2025 Robert Wright

This program is free software: you can redistribute it and/or modify
it under the terms of the GNU General Public License as published by
the Free Software Foundation, either version 3 of the License, or
(at your option) any later version.

This program is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
GNU General Public License for more details.

You should have received a copy of the GNU General Public License
along with this program. If not, see https://www.gnu.org/licenses/.