Why HA Matters for On-Premise Bitrix24
On-premise Bitrix24 deployments shift all redundancy responsibility to your team, and for a 100โ300 user portal, the cost of unplanned downtime on a single server almost always exceeds the investment in a clustered HA architecture that eliminates hardware, OS, and database single points of failure.
Cloud editions of Bitrix24 handle redundancy automatically. The moment you move to the self-hosted (on-premise) box edition โ typically to satisfy data-sovereignty requirements, internal security policies, or compliance obligations โ that responsibility shifts entirely to your team. A single-server setup with 47 GB RAM and 16 CPU cores can handle a medium-sized portal comfortably, but it provides zero protection against hardware failure, OS-level crashes, or database corruption.
In practice, the cost of unplanned downtime for a 100โ300 user portal almost always exceeds the cost of the additional infrastructure required to prevent it. Clustered deployments also make maintenance windows shorter: you can roll updates through nodes one at a time without taking the portal offline. For context on what on-premise deployment involves end-to-end, see Self-Hosted CRM: Why Choose Bitrix24 On-Premise for Data Sovereignty.
Core Components of a Bitrix24 HA Architecture
A production-grade Bitrix24 HA setup requires five fully redundant layers: a load balancer, two or more stateless web nodes, shared file storage, a MySQL master-replica database cluster, and a shared cache/session bus โ leaving any single layer unredundant defeats the entire architecture.
A production-grade HA setup for self-hosted Bitrix24 typically consists of five layers:
| Layer | Component | Role |
|---|---|---|
| Load balancer | nginx / HAProxy | Distributes HTTP/S traffic; detects dead nodes |
| Web nodes (ร2 or more) | Apache + PHP-FPM | Stateless application tier |
| Shared file storage | NFS / GlusterFS / S3-compatible | Single source of truth for /home/bitrix/www/upload and other user content |
| Database cluster | MySQL/Percona โ master + replica(s) | Persistent data with automatic or manual failover |
| Cache / session bus | Memcached or Redis | Shared session storage so users don't lose sessions on node switchover |
All layers must be redundant โ hardening the web tier while leaving a standalone database defeats the purpose.
Web Environment Requirements and Version Pinning
All cluster nodes must run the identical Bitrix VM template, with CentOS Stream 9, nginx 1.26+, Apache 2.4.62+, PHP 8.2+, and Percona Server 8.0+; mixed environments between nodes cause subtle, hard-to-diagnose bugs and outdated environments block critical updates.
The Bitrix web environment is versioned and opinionated. Based on project audit data, running on an outdated environment (e.g., web environment 7.5.1 when 9.x is current) introduces instability risks and blocks certain updates. The official Bitrix virtual machine image bundles:
- OS: CentOS Stream 9 (CentOS 7 is end-of-life and carries unpatched CVEs)
- Reverse proxy: nginx 1.26+
- Application server: Apache 2.4.62+
- PHP: 8.2+ (PHP-FPM mode;
display_errorsanddisplay_startup_errorsmust be off in production) - Database: Percona Server 8.0+ (MySQL 5.7 is EOL)
When building cluster nodes, always deploy each node from the same Bitrix VM template version. Mixed environments โ one node on old PHP, another on new โ cause subtle, hard-to-diagnose bugs.
Key MySQL tuning points for clustered deployments: - Disable
query_cache(removed in MySQL 8, but still present in some Percona 5.7 configs โ it causes cache-invalidation races across nodes) - Disablelocal_infile(security hardening) - Increaseinnodb_log_file_sizebeyond the default 64 MB โ a medium portal with a ~28 GB database benefits from values of 512 MBโ2 GB to reduce checkpoint pressure - Tunejoin_buffer_sizeandsort_buffer_sizeaccording to workload profiling, not defaults
MySQL Master-Replica Replication Setup
Configure Bitrix24's .settings.php with separate default (master) and slave (read-only replica) connection profiles, and enforce GTID-based semi-synchronous replication with ROW binary log format and sync_binlog = 1 to guarantee durability and enable automatic failover via Orchestrator or ProxySQL.
Bitrix24's database connection is defined in /home/bitrix/www/bitrix/.settings.php (the connections block). In a clustered setup you configure two connection profiles:
'default' => [
'className' => '\Bitrix\Main\DB\MysqliConnection',
'host' => 'db-master.internal',
'database' => 'sitemanager',
'login' => 'bitrix',
'password' => '<strong-password>',
'options' => 2,
],
'slave' => [
'className' => '\Bitrix\Main\DB\MysqliConnection',
'host' => 'db-replica.internal',
'database' => 'sitemanager',
'login' => 'bitrix',
'password' => '<strong-password>',
'options' => 2,
'readonly' => true,
],
Bitrix24 uses the slave connection for read queries automatically when it is present. Key replication checklist:
- GTID-based replication โ easier failover promotion than file/position-based
- Semi-synchronous replication โ at least one replica acknowledges each commit before the master returns success; prevents data loss on master crash
- Binary log format:
ROWโ required for Bitrix compatibility innodb_flush_log_at_trx_commit = 1andsync_binlog = 1on master โ critical for durability- Replica lag monitoring โ alert if
Seconds_Behind_Masterexceeds 30 seconds; Bitrix's session and cache logic assume near-real-time replication
For automatic failover, tools like Orchestrator or ProxySQL can promote a replica to master and update the connection string without manual intervention.
Shared Storage for Uploaded Files
The /home/bitrix/www/upload directory must be identical across all web nodes in real time; NFS v4 is the pragmatic default for deployments up to ~500 users, but file-based cache must be replaced with Memcached or Redis before adding a second web node.
The /home/bitrix/www/upload directory (and a few others, including cache if you use file-based caching) must be identical across all web nodes in real time. Options ranked by operational complexity:
| Solution | Latency | Complexity | Suitable for |
|---|---|---|---|
| NFS v4 over a dedicated storage node | Low | Low | Most deployments up to ~500 users |
| GlusterFS replicated volume | Medium | Medium | Geo-distributed nodes |
| Ceph / RBD block device | Very low | High | Large-scale / enterprise |
| S3-compatible object storage + custom adapter | Low (with CDN) | Medium | Cloud-hybrid setups |
NFS remains the pragmatic default for on-premise mid-size deployments. Mount it with noatime,nodiratime to reduce write amplification, and monitor mount availability with a watchdog โ a stale NFS mount that hangs rather than errors is a classic cause of web-node freezes.
The cache directory deserves special attention: file-based cache does not scale across nodes. Switch to Memcached or Redis as the cache backend (/bitrix/.settings.php, cache section) before you add the second web node.
Load Balancer and Session Stickiness
The load balancer must use least_conn distribution with max_fails=3 fail_timeout=15s health checks, while sessions must be moved from the default filesystem handler to Redis with persistence enabled so users on any node retain their sessions across switchovers.
The load balancer sits in front of all web nodes and performs two jobs: traffic distribution and health checking.
nginx upstream example (simplified):
upstream bitrix_nodes {
least_conn;
server web1.internal:80 max_fails=3 fail_timeout=15s;
server web2.internal:80 max_fails=3 fail_timeout=15s;
keepalive 32;
}
Session handling: Bitrix24 stores sessions on the file system by default. In a cluster you must move sessions to shared storage: - Memcached โ fastest, but sessions are lost on memcached restart - Redis (with persistence enabled) โ recommended; survives restarts
Configure in /etc/php.d/session.ini (or the PHP-FPM pool config):
session.save_handler = redis
session.save_path = "tcp://redis.internal:6379"
Also configure Push & Pull (the real-time notification server) on each web node to point to the same Redis pub/sub channel โ otherwise users on different nodes miss chat messages and live feed updates.
HA Architecture Diagram
The canonical Bitrix24 HA topology routes HTTPS traffic through an nginx/HAProxy load balancer to two stateless Apache+PHP-FPM web nodes sharing NFS storage and a Redis session/cache bus, all writing to a Percona 8.0 MySQL master that replicates via GTID to a hot standby replica.
The diagram below shows how incoming HTTPS traffic flows through a load balancer to two stateless web nodes, which share a file storage backend and a Redis session/cache bus, with a MySQL master-replica pair handling all persistence.
flowchart TD
CLIENT[Users / Browsers] --> LB[Load Balancer\nnginx / HAProxy\nSSL Termination]
LB --> WEB1[Web Node 1\nApache + PHP-FPM]
LB --> WEB2[Web Node 2\nApache + PHP-FPM]
WEB1 --> NFS[Shared Storage\nNFS / GlusterFS\nupload, static files]
WEB2 --> NFS
WEB1 --> REDIS[Redis\nSessions + Cache\n+ Push&Pull]
WEB2 --> REDIS
WEB1 --> DBMASTER[MySQL Master\nPercona 8.0]
WEB2 --> DBMASTER
DBMASTER -- GTID replication --> DBREPLICA[MySQL Replica\nPercona 8.0]
DBMASTER -.failover.-> DBREPLICA
Security Hardening Across the Cluster
Every additional web node expands the attack surface, so security controls โ including IP allowlists for the admin panel, HSTS headers, PHP error display disabled, 2FA for all admin accounts, and a shared IP blocklist synced via ipset โ must be applied consistently at the load-balancer level across the entire cluster.
Adding more nodes increases the attack surface proportionally. A security audit of a typical single-node installation surfaces findings that become amplified in a cluster โ every node must be hardened consistently. Core checklist:
- Admin panel access restricted by IP allowlist โ apply at the load balancer level (not just per-node) so it is enforced even if nginx on one node is misconfigured
- HSTS header + HTTP โ HTTPS redirect โ configure on the load balancer, not on individual nodes, to avoid inconsistency
- PHP error display disabled โ
display_errors = Offanddisplay_startup_errors = Offin every node'sphp.ini; leaking stack traces to the browser exposes file paths, SQL structure, and config details - Two-factor authentication (2FA) for all admin accounts โ a cluster-level config in Bitrix admin panel applies to all nodes automatically
- Proactive Protection module โ ensure the minimum security level is above the platform default; disable frame embedding unless required
- Firewall with IP blocklist โ maintain a shared blocklist (e.g., via ipset synced across nodes) and update it regularly
- Web antivirus โ Bitrix's built-in web antivirus adds some CPU overhead; evaluate the performance impact on each node before enabling in production
For update procedures and testing practices, the pattern from real projects is: always test on a staging replica first, back up database + files + configs, then roll out node by node. This is doubly important in a cluster because a failed update on one node can cause split-brain behaviour if session or cache schemas diverge. See Bitrix24 Implementation Cost & Timeline: Real Data from 1,300+ Projects for realistic planning figures if you are sizing the HA project budget.
Update and Maintenance Procedures in a Clustered Environment
Rolling updates in a Bitrix24 cluster require a full backup, staging validation covering CRM, tasks, integrations, and Push & Pull, followed by node-by-node rollout with the updated node drained at the load balancer first, keeping a pre-update database snapshot available for at least 48 hours post-deployment.
Rolling updates are one of the primary operational benefits of an HA cluster. The recommended procedure from real project plans:
- Full backup of the master database, all file storage, and all configuration files before any change
- Staging validation โ spin up a copy of the entire cluster (or at minimum a single-node replica of production), apply the update, and run functional tests: - CRM entity creation and editing - Task workflows and business processes - 1C or ERP integrations (sync scripts, POST-based data exchange) - Chat and Push & Pull notifications
- Code migration: customisations should live in
/local/rather than inside/bitrix/module directories โ this is the officially supported approach and prevents core updates from overwriting your changes. Migrate any files in/bitrix/php_interface/to/local/before updating - Node-by-node rollout: drain node 1 at the load balancer (set to
downin upstream config), update it, smoke-test, then repeat for node 2 - Post-update DB checks: run Bitrix's built-in database structure check and resolve any reported errors (typically a handful of auto-fixable schema mismatches)
- Rollback plan: keep the pre-update database snapshot readily accessible for at least 48 hours post-deployment
If your team is also considering migrating from a cloud CRM to a self-hosted setup, the same staged approach applies โ see How to Migrate from HubSpot to Bitrix24: Step-by-Step Plan for a migration framework that transfers smoothly to an on-premise cluster target.
Monitoring and Failover Testing
A Bitrix24 HA cluster must be validated with scheduled failure tests โ monthly web-node kills (pass: failover within 15 s, no session loss), quarterly MySQL master kills (pass: reconnection within 60 s), and bi-annual full DR restores โ instrumented via Prometheus and Grafana tracking replication lag, Redis hit rate, and upstream health.
A cluster that has never been tested under failure conditions is not an HA cluster โ it is a more expensive single-server setup. Build these checks into your runbook:
| Test | Frequency | Pass criterion |
|---|---|---|
| Kill web node 1 | Monthly | Traffic shifts to web node 2 within fail_timeout (default 15 s); no user session loss |
| Kill MySQL master | Quarterly | Replica promoted; application reconnects within 60 s |
| NFS mount failure simulation | Quarterly | Application returns a clean error, does not hang web workers |
| Full DR restore from backup | Bi-annually | Portal operational from backup in under defined RTO |
Instrument your cluster with at minimum: node-level resource metrics (CPU, memory, disk I/O), MySQL replication lag, Memcached/Redis hit rate, and load balancer upstream health โ ideally surfaced in a single dashboard (Prometheus + Grafana is a common stack with the Bitrix environment).