Speaker

李泓旻 (Andrew)

Material

Note

Problem to solve

  1. many small datasets -> ray
  2. out-of-core dataset -> modin

Ray

  • core concept
    1. tasks
      • stateless
      • return a future: the result of the tasks
      • idempotence
    2. actors
      • stateful
      • can be passed to other actors or tasks
# initialize a ray cluster (by default your local machine)
ray.init()
  • components
    • global control store
      • maintain the control state
      • key-value store with pub-sub functionality
      • benefits
        • fault tolerance
        • low latency
    • global scheduler
    • local scheduler
    • in-memory object store
      • plasma
      • store
        • inputs
        • outputs
        • stateless computation
      • on each node, Ray has the object store via shared memory
      • external storage is also supported

ray

ray.init(runtime_env=runtime_env)

Modin

import modin.pandas as pd
  • Why modin?
    • high pandas API coverage (90% up)
  • What if some pandas API is not supported?
    • fallback to default to pandas mode

Share on: TwitterFacebookEmail


Published

Category

PyCon APAC 2022

Tags

Contact