Deconstructing the Monolith: A Pragmatic Migration to Microservices
Scaling an EdTech platform across East and West Africa introduces unique engineering constraints. Network connectivity is often intermittent, and the majority of our rural users interact with our platform via USSD (Unstructured Supplementary Service Data) and SMS over GSM networks, rather than rich web applications.
When our user base scaled past the five-million mark, our core PHP monolith hit an architectural wall. The same codebase was managing synchronous USSD sessions, heavy background SMS broadcasts, billing ledgers, and student dashboards.
This is an architectural deep dive into how we strangled the monolith, migrating critical USSD and transactional subsystems to Elixir and Node.js while maintaining data consistency across services.
The Latency Wall: Synchronous Blocking
USSD queries require responses within a strict 5-10 second window dictated by mobile network operators (MNOs). In our legacy monolith, the USSD gateway would send an HTTP POST request to our PHP-FPM endpoint.
Each request held a PHP process open while performing synchronous database reads, writing session state, and communicating with third-party billing APIs. If Safaricom’s or MTN’s payment gateways delayed by three seconds, all PHP-FPM worker pools were exhausted instantly. This caused a cascading failure that dropped student API traffic, website routing, and subsequent USSD requests.
Solution 1: Elixir/BEAM for USSD Session State Machines
USSD sessions are stateful and interactive. A user dials a shortcode (e.g., *384#), navigates a menu, inputs answers, and receives scores.
To solve the concurrency bottleneck, we carved out USSD session routing and rebuilt it in Elixir. The Erlang Virtual Machine (BEAM) is built for massive concurrency, allowing us to spin up an isolated, lightweight process for every active mobile session. Each process consumes only ~2.6KB of memory.
Here is a simplified GenServer behavior representing how we managed USSD session states:
defmodule Eneza.UssdSession do
use GenServer, restart: :transient
# Client API
def start_link(session_id, msisdn) do
GenServer.start_link(__MODULE__, {msisdn, :initial_menu}, name: via_tuple(session_id))
end
def process_input(session_id, input) do
GenServer.call(via_tuple(session_id), {:process_input, input})
end
# Server Callbacks
@impl true
def init({msisdn, state}) do
# Set session timeout of 20 seconds
{:ok, %{msisdn: msisdn, menu_state: state, history: []}, 20_000}
end
@impl true
def handle_call({:process_input, "*384#"}, _from, socket_state) do
response = "CON Welcome to Eneza. Choose option:\n1. Start Quiz\n2. View Score"
{:reply, response, %{socket_state | menu_state: :main_menu}}
end
@impl true
def handle_call({:process_input, "1"}, _from, %{menu_state: :main_menu} = socket_state) do
response = "CON Select subject:\n1. Math\n2. Science"
{:reply, response, %{socket_state | menu_state: :select_subject, history: ["main_menu" | socket_state.history]}}
end
@impl true
def handle_info(:timeout, socket_state) do
# Handle session timeout cleanup
{:stop, :normal, socket_state}
end
defp via_tuple(session_id), do: {:via, Registry, {Eneza.SessionRegistry, session_id}}
end
With this architecture, response times dropped from 400ms to under 15ms. If a session crashed due to a malformed payload, the Erlang Supervision tree restarted the registry node immediately, isolating the failure and preventing it from impacting the other 50,000 active sessions.
Solution 2: Strangler Fig Gateway and Session Routing
We did not split the database or rewrite the codebase overnight. We utilized the Strangler Fig Pattern, introducing Nginx reverse proxy routing at the edge:
/ussdtraffic was immediately routed to the new Elixir service./api/v2endpoints for student dashboard APIs were routed to a Node.js microservice optimized for rapid Redis read-caching./adminand/billingremained on the legacy PHP monolith.
The primary challenge was managing shared session state. The PHP monolith stored sessions in server-local files. To support stateless microservices, we:
- Migrated all session storage to a shared, high-availability Redis cluster.
- Implemented a light Gateway Service that intercepts requests, exchanges the legacy session cookie for a stateless, cryptographically signed JWT containing user scopes/claims, and forwards it downstream.
This allowed the Elixir and Node.js microservices to remain stateless and scale independently without querying the legacy database for user authentication.
Solution 3: Distributed Idempotency and Transactional Outbox
In a distributed system, network calls fail. If the USSD service (Elixir) finishes a student quiz and sends a QuizCompleted event to RabbitMQ/Kafka, network retries might result in the event being published twice. If the Billing microservice consumes this duplicate event, the student is double-charged.
To guarantee data consistency without distributed locking, we implemented the Transactional Outbox Pattern alongside Consumer Idempotency:
1. Transactional Outbox
Instead of publishing directly to the broker inside our business logic, we insert an event record into a database outbox table within the same atomic ACID transaction as the business state update:
BEGIN;
UPDATE student_progress SET quiz_status = 'completed' WHERE student_id = 4928;
INSERT INTO outbox (id, event_type, payload, status)
VALUES ('evt_9281a', 'QuizCompleted', '{"student_id": 4928}', 'pending');
COMMIT;
A separate, highly optimized process (using Debezium/Kafka Connect or a lightweight polling daemon) reads the outbox table and guarantees at-least-once delivery to the message broker.
2. Consumer Idempotency Keys
On the consumer side, the billing and analytics engines track handled message IDs using an idempotency_keys table with a unique database constraint (message_id + service_name):
-- On Consumer Node:
BEGIN;
INSERT INTO idempotency_keys (key, service) VALUES ('evt_9281a', 'billing_service');
-- If unique constraint fails, transaction rolls back immediately, preventing double billing
UPDATE accounts SET balance = balance - 5 WHERE student_id = 4928;
COMMIT;
This pattern ensured eventual consistency and prevented financial ledger corruption under extreme network partition storms.
Conclusion
A pragmatic microservices migration is about choosing your battles. We didn’t decouple the system because microservices are fashionable; we decoupled it because we had different resource demands:
- Elixir handled stateful, concurrent socket connections.
- Node.js served high-throughput Redis-backed content APIs.
- PHP remained for complex, low-concurrency payroll and billing ledgers.
By pairing the right runtime with the right scaling problem, we scaled Eneza to five million active learners without sacrificing billing integrity or platform reliability.