使用队列系统 –陈浩南的文档和博客

Documentation

使用队列系统

队列系统的作用

简而言之：排队。你告诉队列系统使用什么资源（例如多少核的 CPU）运行某个程序，轮到你的任务执行时，队列系统会按照你的要求执行任务。队列系统会考虑服务器的负载能力（不可以同时运行太多任务把服务器挤垮）、有多人使用服务器时公平分配资源，以及记录任务执行过程中的开销。

队列系统不与某个特定的计算软件（例如 VASP）绑定。也就是说，你可以脱离队列系统直接运行 VASP，也可以把别的软件也放到队列系统里运行；它们虽然经常在一起使用，但本质上是两个分离的概念。¹

使用队列系统

不同的服务器使用不同的队列系统。

厦大超算（jykang）：使用 LSF。
srv1 和 srv2：使用 Slurm。
srv1（Windows）：没有队列系统，但请阅读 Windows 章节，了解相关注意事项。

LSF

IBM Spectrum LSF，适用于厦大超算（jykang）

Slurm

Slurm，适用于 srv1 srv2

Windows

Windows 没有通用的免费的队列系统程序，通常手动控制任务依次执行。 ~~我曾经在2024年暑假答应鹏岗帮他写一个，但事实是我把他鸽了，有空一定写，咕咕咕。~~

你可以使用 FDTD Solutions 自带的一个单用户的队列系统。在 FDTD Solutions 内的控制台使用 addjob 命令将要模拟的文件加入队列，然后它就会依次运行。文档见这里。

⚠️

不要同时启动过多的计算任务。
例如，如果一个任务需要一个小时跑完，那么按顺序运行两个任务需要两个小时，但是同时运行两个任务需要的时间往往远大于两个小时。若同时运行 10 个任务，可能连系统都没有反应了。

一个简易的队列系统的实现

下面是我做 FDTD 计算时写的一个小玩意儿，通过判断当前目录（包括子目录）下面的各个 fsp 是否存在对应的 xxx_p0.log 文件来判断每个 fsp 是否已经被模拟过，如果没有就依次模拟。每一次模拟完成后都会再扫描一遍整个目录路，因此可以动态地把需要模拟的 fsp 在多台服务器之间转移（负载均衡）。代码兼容 Linux 和 Windows，模拟完一个 fsp 之后还会给我发个消息（这样我可以知道进度）。你可以根据自己的需求修改使用，或者用 Python 之类你更熟悉的语言重写。

# define FMT_HEADER_ONLY
#include <fmt/format.h>
#include <fmt/ranges.h>
#include <filesystem>
#include <iostream>
#include <cstdlib>
#include <regex>
#include <string>
#include <thread>
#include <set>
#include <fstream>
# define CPPHTTPLIB_OPENSSL_SUPPORT
#include <httplib.h>

using namespace std::literals;

int main()
{
  while (true)
  {
    std::set<std::string> fsp, log;
    std::string job;
    // std::string hostname;
    for (const auto& entry : std::filesystem::recursive_directory_iterator("."))
    {
      std::string file = entry.path().string();
      if (std::regex_match(file, std::regex(R"(^(.*)\.fsp$)")))
        fsp.insert(file);
      else if (std::regex_match(file, std::regex(R"(^(.*)_p0\.log$)")))
        log.insert(file);
    }
    for (const auto& f : fsp)
      if (!log.contains(std::regex_replace(f, std::regex(R"(^(.*)\.fsp$)"), "$1_p0.log")))
      {
        job = f;
        break;
      }
    if (job == "")
        break;
// {
//     std::ifstream hostname_ifs("hostname.txt");
//     hostname_ifs >> hostname;
// }

    auto current_path = std::filesystem::current_path();
    std::filesystem::current_path(std::filesystem::path(job).parent_path());
    auto filename = std::filesystem::path(job).filename().string();
// if windows
# ifdef _WIN32
    std::string command = "fdtd-solutions -run run.lsf -exit -trust-script";
# else
    std::string command = "DISPLAY=:0 /opt/lumerical/fdtd/bin/fdtd-solutions -nw -run run.lsf -exit";
# endif
    std::cout << fmt::format
    (
        "\r{}/{} running {} from {}\n command {}",
        log.size(),
        fsp.size(),
        filename,
        std::filesystem::current_path().string(),
        command
    )
    << std::flush;
// {
//     // https://api.chn.moe/notify_silent.php?message={hostname} {}/{}
//     httplib::SSLClient client("api.chn.moe");
//     std::string path = fmt::format("/notify_silent.php?message={}%20{}%2F{}", hostname, log.size(), fsp.size());
//     client.Get(path.c_str());
// }
    {
      std::ofstream os("run.lsf");
      os << fmt::format(R"(
load("{}");
switchtolayout;
run;
)", filename);
      os.close();
    }
    system(command.c_str());
    std::this_thread::sleep_for(5s);
    std::filesystem::remove("run.lsf");
    std::filesystem::current_path(current_path);
  } 
}

实际上队列系统与要运行的软件还是有许多耦合的，尤其是使用 MPI 并行的程序（包括绝大多数成熟的大型科学计算软件）。如何让队列系统与这些软件对接好有时会很麻烦，只是普通用户不需要关心（我替你们搞定了）。 ↩︎