Software engineer with 10+ years of experience, specializing in building, scaling and operating mission-critical services and platform engineering. I am a software quality advocate, specializing in driving reliability and operational excellence. My work has centered on developing and leading the strategies essential to achieving the organization's reliability vision. I have deep expertise in Java, Python, system design, observability, automation and incident management for high-volume, globally distributed systems.
| Languages | Data | CI/CD | Observability | ML/AI | OS |
|---|---|---|---|---|---|
| Java (Spring) | MSSQL | Docker | Prometheus | Prophet | Linux |
| Python | PostgreSQL | Kubernetes | Grafana | Azure OpenAI | Windows |
| Groovy | Elasticsearch | Ansible | Graylog | Copilot | |
| Kotlin | Kafka | Rundeck | NewRelic | LM Studio | |
| RabbitMq | OpsGenie | Spacy | |||
| Redis |
Product observability, proactive approach towards reliability, incident response automation, performance testing, prototyping, defining reliability and quality strategy
Developed an AI Agent that automates the process of troubleshooting (root cause analysis) by connecting to observability tools and using anomaly detection to identify issues. The process is driven based on knowledge of the platform and usage of OpenAI for reasoning. (Azure Data Explorer, Kusto query language, Python, Slack API, OpenAI LLM models, ReAct prompting).
Defined and implemented strategies and roadmaps needed for reaching higher reliability objectives. Currently defining and implementing chaos engineering practices. Coordinating the implementation of an easy and safe to use performance testing tool.
Developed a course covering the entire software development life cycle for software quality perspective. (Learning outcomes, Bloom taxonomy)
Implemented end-to-end observability on product level. Development and maintenance of a custom tracing solution and instrumentation of the SMS product flow. Creating product-specific and client-specific dashboard templates for support teams. Defining policies for structured approach to product monitoring based on synthetic, real-user and front-end monitoring.
Ensured timely delivery of initiatives through careful planning and breakdown. Defined technical learning paths for SRE team. Led cross-functional teams in short-term and long-term projects. Mentored two SRE colleagues from senior to staff level. Promoted SRE practices across the organization from developers to C-level organization.
Incident management, Observability, Reliability, Automation
Redefined and improved the entire process on company level. Implementing metrics and data collection to drive reliability improvements. Responsibilities: platform monitoring, incident response and review, impact assessment, coordination with management, product and support teams, incident and platform reliability reporting, product and service review from a reliability perspective. (Jira, Slack, Confluence)
Created dashboards based on various data sources for efficient monitoring and troubleshooting, setting up actionable alerts and notification policies. Using various sources and tools for troubleshooting and root cause analysis. Defined and implemented Service Level Indicators and Objectives for core products. (Prometheus, Alert Manager, Grafana, OpsGenie, GrayLog, Kibana, NewRelic)
Coordinated and participated in company-wide high-risk infrastructure maintenance tasks.
Coordination of the initiative with engineering directors; definition of survey questions in close collaboration with human resources and employer branding departments; analysis of survey results.
Java, Spring, Data pipelines, API Gateways
Developed, refactored and maintained highly-available, mission critical services related identity management and authentication, that handled all authentication and authorization requests for the platform. (Java, Spring framework, MS SQL, Hibernate, Redis)
Set up and maintained several Elasticsearch clusters (up to 40 nodes, ~100T of data). Developed and maintained related services. (Java, Kafka) for data ingestion and manipulation.
Designed, developed and maintained REST API backends for handling SMS traffic, and HTTP API gateways serving as a platform for the other engineering teams. (Java, Groovy, RabbitMq, Spring framework, Tomcat, WebFlux, RxJava, Reactor)
Designed and implemented an in-house chat bot solution using NLP with focus on developing the intent engine and named entity recognition. (Java, Python, spaCy)
Created and coordinated an escape room for developers.
Researched and introduced the concept of Communities of Practice to the Engineering department as a strategy for improving knowledge sharing across the organization. Organized and led, with the SRM, a community of practice inside the company for revising, promoting and improving SRE practices among developers.
PHP, jQuery, Angular 1.x, MySQL
Optimized SQL queries, introduced a debugger tool to speed up troubleshooting, code quality improvements
Programmatic realization of the particle swarm optimization algorithm
Ontology-assisted approach for learning causal Bayesian network structure
| Languages | Public speeches | Soft skills |
|---|---|---|
| Croatian | JavaCro (2016, 2018, 2019, 2021, 2022) | Communication and organizational skills |
| English | Infobip DevDays (2016) | Team work |
| Italian | Joker conf (2018) | Team culture building |
| French | Meetups (Java Zg, ElasticSearch Zg) | Adaptability |
| Faculty of Humanities and Social Sciences - University of Zagreb (2023) | Analytical thinking |