Skip to main content
Feedback

Pod - Level Sizing

Cluster sizing recommends the number of pods of each type that are required for different Queries per Seconds (QPS).

Pod or Container Sizing

Pod sizing covers two aspects- the resources required by a pod and the number of pods required for a given QPS and number of refresh token requests for OAuth.

The sizing guidelines are generic and resource requirements vary by factors like average payload per traffic request and response, size of configuration, and number of OAuth (refresh + create tokens) requests.

Pod Characteristics

Pod TypeMemory UsageCPU UsageStorage UsageNetwork
cacheHigh (for content caching)HighLowHigh (depending on traffic)
configuiNormalNormalLowLow
loader (CacheLoader in v6.0.0; Loader from v6.1.0+)LowLowLowLow
platformapiLowLowLowLow
trafficmanager (TM)High (It has a lot of G1GC young gc activity)HighLowHigh (depending on traffic)
logcollector statefuset (in v6.2.0)LowNormalNormalHigh

Resource Limits and Requests

Requests are initial allocation of resources and limits define the max memory or CPU a pod can utilize.

We can define limits for CPU or CPU time, and memory required. When defining limits a general recommendation is to set the value for requests to half of the limit. A general rule of thumb is to allow a 20% overhead or bump up space for memory and CPU and to ensure that the application is not paging.

By defining resource limits, you have the following benefits:

  • Pods and containers consume resources and there are situations where one pod can consume more resources leaving other pods starved for them; a starved pod is restarted.

  • Memory leaks in the application drain nodes of memory

  • Optimizes use of resources instead of over-provisioning

  • Allows for automatic horizontal scaling of Cache and traffic manager pods

For more information about units of resource:

Kubernetes Pod Sizing

caution
  1. The following data captured is for indicative recommendations only.
  2. You can choose to vertically scale by providing higher CPU and memory requests and limits.

Each component pod runs a Fluent Bit container sidecar. The CPU and memory utilization figures also include the figures for Fluent Bit container.

Idle State Utilization Observed

ServiceNo. of PodsExpected Utilization/Pod MemoryExpected Utilization/Pod CPU
TM34.7 GB15m/0.015 CPU core
cache31.2 GB25m/0.025 CPU core
loader (CacheLoader in v6.0.0; Loader from v6.1.0+)11.2 GB30m/0.03 CPU core
configUI1600 MB20m/0.02 CPU core
platformapi11.4 GB30m/0.03CPU core
logcollector (in v6.2.0)1500 MB11m/0.011 CPU core

Memory Sizing

Every pod hosts a fluent-bit container. The memory utilization is usually driven by the Java Virtual Machine (JVM) max heap size to which an overhead due to the non-heap usage (OS native java binary/libraries) and the memory footprint of the fluent-bit container are added.

All components (except for configui) execute in a JVM

For the medium (multi-zone) cluster:

ServiceMemory Utilization/Pod Base Memory UsageMemory Utilization/Pod OverheadMemory Utilization/Pod Including Margin
TM4 GB0.7 GB4.9 GB
cache1 GB300 MB1.5 GB
loader (CacheLoader in v6.0.0; Loader from v6.1.0+)1 GB200 MB1.3 GB
configUI600 MB150 MB750 MB
platformapi2 GB400 MB2.5 GB
logcollector (in v6.2.0)500 MB300 MB800 MB

Sizing for the components that are not directly involved in traffic handling

The resource allocation for the loader (CacheLoader in v6.0.0; Loader from v6.1.0+), configUI, and platformapi components does not depend on the volume of traffic handled by the traffic manager component.

ServiceNumber of PodsResource Utilization per Pod MemoryResource Utilization per Pod CPU
loader (CacheLoader in v6.0.0; Loader from v6.1.0+)11.2 GB1 CPU core
configUI1600 MB1 CPU core
platformapi11.4 GB1 CPU core
logcollector1800 MB1.2 CPU core

Sizing for the multi-zone cluster topology

500 QPS

ServiceRecommended no. of PodsMin. no. of PodsRecommended no. of Nodes/ZoneMin. no. of Nodes/ZoneObserved CPU utilization/PodCPU limit inclusive of 25% margin
TM6332900m/0.9 CPU cores1125m/1.2 CPU cores
cache3332500m/0.5 CPU core625m/0.63 CPU cores

1500 QPS

ServiceRecommended no. of PodsMin. no. of PodsRecommended no. of Nodes/ZoneMin. no. of Nodes/ZoneObserved CPU utilization/PodCPU limit inclusive of 25% margin
TM63322500m/2.5 CPU cores3125m/3.2 CPU cores
cache3332750m/0.75 CPU core950m/1 CPU cores

3000 QPS

ServiceRecommended no. of PodsMin. no. of PodsRecommended no. of Nodes/ZoneMin. no. of Nodes/ZoneObserved CPU utilization/PodCPU limit inclusive of 25% margin
TM63325300m/5.3 CPU cores6650m/6.7 CPU cores
cache33321000m/1 CPU core1250m/1.3 CPU cores
logcollector (in v6.2.0)

Fault Tolerance Considerations

For the recommended configuration: having 2 TM + 1 Cache pod per zone, each on distinct node, affords one

  • tolerance for the loss of up to 3 TM pods or 3 TM-hosting nodes

  • tolerance for the loss of two cache-hosting node

  • tolerance for the loss of all the pods from a zone and of up to a 1 TM and 2 cache pods (or of their hosting nodes)

For the minimal configuration: having 1 TM + 1 Cache pod per zone, each on distinct node, affords one

  • degraded mode operation in the event of a loss of a TM-hosting node or of an entire zone (the surviving 2 pods may not have enough CPU to handle all the traffic)

  • tolerance for the loss of up to two cache-hosting nodes

Sizing for the small single zone cluster topology

500 QPS

ServiceRecommended Number of PodsMinimum Number of PodsObserved CPU utilization per podCPU limit inclusive of 25% margin
TM321050m/1.05 CPU cores1125m/1.2 CPU cores
cache32270m/0.27 CPU core340m/0.4 CPU cores
caution

Pod performance is dependent on factors like a network in most cases, and, for the fluent-bit containers, the storage speed. Each TM pod can handle a limited number of concurrent calls configured via the jetty_pool_maxthreads parameter (defaults value set in trafficmanager's values.yaml configuration file is 512). Having more small (in terms of CPU allocation) TM pods vs fewer larger TM pods would allow for a better handling of situations where back-ends are slow to respond. The peak CPU utilization also depends on the JVM garbage collection (GC) activity, not on just the traffic volume.

Requirements for different QPS shown here are based on actual observation. Horizontal Pod Autoscaling (HPA) was not in effect during the load testing.

Pod affinities vis-à-vis placement on nodes

When you do not specify resource requests and limits, the resource usage patterns of each pod determine how workloads are scheduled across nodes. For high availability (HA) deployments, it is important to ensure that pods of the same type are not placed on the same node. Critical components like the Traffic Manager, Logcollector (introduced in 6.2.0), and Cache should also be distributed across nodes. To avoid a single point of failure, anti-affinity rules should be used to prevent these critical pods from being scheduled on the same node.

Pod/Node Affinity

Platform API

(in v6.0.0+)

loader

(cacheloader in v6.0.0; loader from v6.1.0+)

configUI

(in v6.0.0+)

cache

(in v6.0.0+)

TM

(in v6.0.0+)

logcollector

(in v6.2.0)

Platform API

(in v6.0.0+)

loader

(cacheLoader in v6.0.0; loader from v6.1.0+)

configUI

(in v6.0.0+)

cache

(in v6.0.0+)

TM

(in v6.0.0+)

logcollector

(in v6.2.0)

Networking Considerations

  • Local Edition is compatible with all CNCF certified CNI's.

  • Make sure the POD network is initialized with a unique set of CIDR.

  • Services should be properly deployed with a unique service IP for POD-POD communication

Storage Considerations

In LE v6.2.0

In Local Edition 6.2.0, all components are stateless and do not use persistent storage, except for the logcollector, which is deployed as a StatefulSet and uses persistent storage.

Log data collected by the logcollector component requires 1 GB of persistent storage by default. To ensure fast and reliable log storage performance, it is recommended to use a storage class with high write throughput, such as premium-rwo (GCP), gp2 (AWS), Premium_LRS (Azure), or ocs-storagecluster-ceph-rbd (OpenShift).

In LE v6.0.0 and v6.1.0

The Local Edition 6.0.0 and 6.1.0 components are stateless/persistent volume-less.

The database accommodates the following non-API definition objects:

  • Daily or monthly counters.

  • OAuth tokens: A million tokens would require 1.1 GB (Table+Index+binlog).

  • Configuration data: Requires 2 GB of storage.

It is important that the console logs' size and number of log archives are set so that enough can be stored before rollover/loss of logs happens before the Fluent Bit container is ready and able to publish the logs to the observability stack.

You can configure two kubelet configuration settings, containerLogMaxSize and containerLogMaxFiles, using the kubelet configuration file. These settings let you configure the maximum size for each log file and the maximum number of files allowed for each container, respectively.

Refer to Kubernetes Logging Architecture for more information.

On this Page