Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to install Portworx on OCP4 on VMware? #120

Open
angapov opened this issue Mar 11, 2020 · 8 comments
Open

How to install Portworx on OCP4 on VMware? #120

angapov opened this issue Mar 11, 2020 · 8 comments

Comments

@angapov
Copy link

angapov commented Mar 11, 2020

I am interested to install Portworx on OCP 4.3 running on VMware vSphere 6.5 with dynamic VMDK provisioning using Operator.
I tried https://central.portworx.com/specGen but there is no option to specify VMware VMDK backend. It gives me errors like that:

@worker-1.angapov.demo.li9.com portworx[1816291]: time="2020-03-11T14:51:22Z" level=info msg="Node HWType: VirtualMachine"
@worker-1.angapov.demo.li9.com portworx[1816291]: time="2020-03-11T14:51:22Z" level=info msg="Initializing node and joining the cluster portworx..."
@worker-1.angapov.demo.li9.com portworx[1816291]: time="2020-03-11T14:51:22Z" level=info msg="Node index overflow (53376)"
@worker-1.angapov.demo.li9.com portworx[1816291]: time="2020-03-11T14:51:22Z" level=warning msg="Failed to initialize Init PX Storage Service: Storage failed initialization"
@worker-1.angapov.demo.li9.com portworx[1816291]: time="2020-03-11T14:51:22Z" level=info msg="Cleanup Init services"
@worker-1.angapov.demo.li9.com portworx[1816291]: time="2020-03-11T14:51:22Z" level=warning msg="Cleanup Init for service Scheduler."
@worker-1.angapov.demo.li9.com portworx[1816291]: time="2020-03-11T14:51:22Z" level=info msg="Cleanup Initializing node and joining the cluster portworx..."
@worker-1.angapov.demo.li9.com portworx[1816291]: time="2020-03-11T14:51:22Z" level=warning msg="Cleanup Init for service PX Storage Service."
@worker-1.angapov.demo.li9.com portworx[1816291]: time="2020-03-11T14:51:22Z" level=info msg="Cleanup Init for Storage provider PXD"
@worker-1.angapov.demo.li9.com portworx[1816291]: time="2020-03-11T14:51:22Z" level=error msg="Failed to initialize node in cluster. Storage failed initialization"
@worker-1.angapov.demo.li9.com portworx[1816291]: time="2020-03-11T14:51:22Z" level=error msg="Cluster Manager Failure on Node[10.0.0.15]: Storage failed initialization"
@worker-1.angapov.demo.li9.com portworx[1816291]: time="2020-03-11T14:51:22Z" level=warning msg="503   Node status not OK (STATUS_ERROR)" Driver="Cluster API" ID=nodeHealth Request="Cluster API"

PX version is 2.3.6, operator version is 1.2 which is default for OpenShift 4.3, installed from OperatorHub.

I tried https://docs.portworx.com/cloud-references/auto-disk-provisioning/vsphere/ but cluster failed to initialize due to port 9001 conflict with OpenShift oauth-proxy.

Is there any instruction how to do that?

A little bit of background: currently I have three bare metal hosts running ESXi 6.5 with SSD drives, no shared storage. Every SSD is an independent datastore for its ESXi. I've installed vanilla OpenShift 4.3 with dynamic VMware PV provisioning.

@angapov
Copy link
Author

angapov commented Mar 12, 2020

I've added one disk to each OpenShift worker (total 3 nodes) and recreated the StorageCluster with the following spec: https://pastebin.com/raw/X2RajT8R

Pod status is like that:

# oc -n kube-system get pod
NAME                                                    READY   STATUS                  RESTARTS   AGE
autopilot-94dc45dbf-g8r95                               1/1     Running                 0          31m
portworx-api-b46gc                                      1/1     Running                 0          31m
portworx-api-f8kk7                                      1/1     Running                 0          31m
portworx-api-mn8rq                                      1/1     Running                 0          31m
portworx-operator-8467647f7f-2w8r7                      1/1     Running                 1          9d
px-cluster-828b1a05-8020-4b73-9e39-6f6be66a8abc-gjtqf   1/2     Running                 0          4m55s
px-cluster-828b1a05-8020-4b73-9e39-6f6be66a8abc-jbkk4   1/2     Running                 0          4m55s
px-cluster-828b1a05-8020-4b73-9e39-6f6be66a8abc-lxrw9   1/2     Running                 0          4m55s
px-csi-ext-7444d9b4fc-g7rk5                             3/3     Running                 0          31m
px-csi-ext-7444d9b4fc-rfjfk                             3/3     Running                 0          31m
px-csi-ext-7444d9b4fc-vz9ps                             3/3     Running                 0          31m
px-lighthouse-68dcd48944-bjxkp                          0/3     Init:CrashLoopBackOff   7          31m
stork-6f8fc7b967-2ffwq                                  1/1     Running                 0          31m
stork-6f8fc7b967-5xw6b                                  1/1     Running                 0          31m
stork-6f8fc7b967-cwhlk                                  1/1     Running                 0          31m
stork-scheduler-6847c58d8d-9vjtz                        1/1     Running                 0          31m
stork-scheduler-6847c58d8d-kfj6z                        1/1     Running                 0          31m
stork-scheduler-6847c58d8d-nfhhw                        1/1     Running                 0          31m

Logs of px-cluster pod: https://pastebin.com/raw/vTGVqGcQ

Something is definitely goes wrong. Can you help me?

@sanjaynaikwadi
Copy link

@angapov - can you share the output of lsblk and blkid from all the worker nodes ?

@angapov
Copy link
Author

angapov commented Mar 12, 2020

[core@worker-0 ~]$ lsblk
NAME                         MAJ:MIN RM   SIZE RO TYPE MOUNTPOINT
sda                            8:0    0   120G  0 disk
├─sda1                         8:1    0   384M  0 part /boot
├─sda2                         8:2    0   127M  0 part /boot/efi
├─sda3                         8:3    0     1M  0 part
└─sda4                         8:4    0 119.5G  0 part
  └─coreos-luks-root-nocrypt 253:0    0 119.5G  0 dm   /sysroot
sdb                            8:16   0    50G  0 disk
[core@worker-0 ~]$ blkid
/dev/mapper/coreos-luks-root-nocrypt: LABEL="root" UUID="9599ed34-a678-4e04-9bda-675bc2e8ba7b" TYPE="xfs"

[core@worker-1 ~]$ lsblk
NAME                         MAJ:MIN RM   SIZE RO TYPE MOUNTPOINT
sda                            8:0    0   120G  0 disk
├─sda1                         8:1    0   384M  0 part /boot
├─sda2                         8:2    0   127M  0 part /boot/efi
├─sda3                         8:3    0     1M  0 part
└─sda4                         8:4    0 119.5G  0 part
  └─coreos-luks-root-nocrypt 253:0    0 119.5G  0 dm   /sysroot
sdb                            8:16   0    50G  0 disk
[core@worker-1 ~]$ blkid
/dev/mapper/coreos-luks-root-nocrypt: LABEL="root" UUID="9599ed34-a678-4e04-9bda-675bc2e8ba7b" TYPE="xfs"

[core@worker-2 ~]$ lsblk
NAME                         MAJ:MIN RM   SIZE RO TYPE MOUNTPOINT
sda                            8:0    0   120G  0 disk
├─sda1                         8:1    0   384M  0 part /boot
├─sda2                         8:2    0   127M  0 part /boot/efi
├─sda3                         8:3    0     1M  0 part
└─sda4                         8:4    0 119.5G  0 part
  └─coreos-luks-root-nocrypt 253:0    0 119.5G  0 dm   /sysroot
sdb                            8:16   0    50G  0 disk
[core@worker-2 ~]$ blkid
/dev/mapper/coreos-luks-root-nocrypt: LABEL="root" UUID="9599ed34-a678-4e04-9bda-675bc2e8ba7b" TYPE="xfs"

@piyush-nimbalkar
Copy link
Contributor

piyush-nimbalkar commented Mar 12, 2020

@angapov It looks like your nodes have restarted enough times that Portworx is running out of node index. Can you destroy your portworx cluster and re-create it? If it fails, can you paste the logs again?

There are instruction in the portworx operator description on OpenShift about how to cleanly uninstall. Basically add a deleteStrategy to your StorageCluster and then delete it:

spec:
  deleteStrategy:
    type: UninstallAndWipe

@piyush-nimbalkar
Copy link
Contributor

Also, operator 1.2 automatically uses 17001-17020 port range for portworx if running on openshift (to avoid the port conflict introduced in OpenShift 4.3).
My guess is that you had an older version of operator before which tried to run on 9001 port. In your latest logs, it seems to be using 17001 port.

@angapov
Copy link
Author

angapov commented Mar 13, 2020

@piyush-nimbalkar you are right, I've added deleteStrategy and recreated StorageCluster and it worked nicely. Thank you very much!

Now cluster is running using dynamic VMDK provisioned volumes. However, I noticed that I have 3 worker nodes but only 2 drives.

[root@jumphost ~]# oc -n kube-system exec px-cluster-5s82k -- /opt/pwx/bin/pxctl clouddrive list
Defaulting container name to portworx.
Use 'oc describe pod/px-cluster-5s82k -n kube-system' to see all of the containers in this pod.
Cloud Drives Summary
	Number of nodes in the cluster:  3
	Number of drive sets in use:  2
	List of storage nodes:  [90cbbeaf-6157-4b52-bf44-7622f6e08c2f 9e17d9e7-c81f-43d7-ac4f-653c9d8b4a71]
	List of storage less nodes:  [7b83b83d-7cbb-486f-812b-07609044c096]

Drive Set List
	NodeIndex	NodeID					InstanceID				Zone	State	Drive IDs
	2		7b83b83d-7cbb-486f-812b-07609044c096	422d8683-19a6-c332-55ef-daa2210fd7d2	default	In Use	-
	0		90cbbeaf-6157-4b52-bf44-7622f6e08c2f	422dc8c7-946f-ab50-7d13-df7059440f84	default	In Use	[datastore-10] osd-provisioned-disks/PX-DO-NOT-DELETE-e5051cb5-d323-4a80-ab97-1e880827ccb0.vmdk(data)
	1		9e17d9e7-c81f-43d7-ac4f-653c9d8b4a71	422d9bcb-22d2-d373-f25d-1495f26cbe50	default	In Use	[datastore-34] osd-provisioned-disks/PX-DO-NOT-DELETE-8881c670-1639-4cad-90df-5c932536823b.vmdk(data)

Do you know how can I add disk on storageless node again using dynamic VMDK provisioning?

I tried expanding pool like this but it gave error:

[root@jumphost ~]# oc -n kube-system exec px-cluster-5s82k -- /opt/pwx/bin/pxctl service pool expand -s 100 -u 919993f4-1034-47dd-a60a-feafad8c39c6 -o add-disk
Defaulting container name to portworx.
Use 'oc describe pod/px-cluster-5s82k -n kube-system' to see all of the containers in this pod.
Request to expand pool: 919993f4-1034-47dd-a60a-feafad8c39c6 to size: 100 using operation: add-disk
service pool expand: resizing pool with an auto journal device is not supported
command terminated with exit code 1

@sanjaynaikwadi
Copy link

@angapov - Can you share the logs from this node, we need to see why it was not able to create/attach the disk to this node.
2 7b83b83d-7cbb-486f-812b-07609044c096 422d8683-19a6-c332-55ef-daa2210fd7d2
You can get the information from pxctl status

Looks like you have journal configured on the data disk, from previous logs I see -j auto is specified. While installation you can just request for 3GB for Journal partition which will be on different disk then you data.

@angapov
Copy link
Author

angapov commented Mar 13, 2020

@sanjaynaikwadi here are the logs: https://pastebin.com/raw/5xPJRFqV

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants