programming your life!

Home ~/

View My GitHub Profile

  • vim note

    vim linux下打开中文乱码


    set fileencodings=utf-8,ucs-bom,gb18030,gbk,gb2312,cp936 set termencoding=utf-8 set encoding=utf-8

  • spark note

    value toDF is not a member of org.apache.spark.rdd.RDD

    Import implicits: Note that this should be done only after an instance of org.apache.spark.sql.SQLContext is created. It should be written as:

    val sqlContext= new org.apache.spark.sql.SQLContext(sc)
    import sqlContext.implicits._

    value toDF is not a member of org.apache.spark.rdd.RDD

    How to convert rdd object to dataframe in spark (scala)

    SqlContext has a number of createDataFrame methods that create a DataFrame given an RDD. I imagine one of these will work for your context.

    For example:

    def createDataFrame(rowRDD: RDD[Row], schema: StructType): DataFrame
    // you should import implitcits after create a new SQLContext
    val sqlContext = new SQLContext(sc)
    import sqlContext.implicits._
        StructField("request1", MapType(StringType, StringType, true), true),
        StructField("response1", StringType, true)

    How to convert rdd object to dataframe in spark (scala) Converting Map type in Case Class to StructField Type

    How to get hadoop put to create directories if they don’t exist

    hadoop fs -mkdir -p <path>

    Spark union of multiple RDDs

    from functools import reduce  # For Python 3.x
    from pyspark.sql import DataFrame
    def unionAll(*dfs):
        return reduce(DataFrame.unionAll, dfs)
    df1 = sqlContext.createDataFrame([(1, "foo1"), (2, "bar1")], ("k", "v"))
    df2 = sqlContext.createDataFrame([(3, "foo2"), (4, "bar2")], ("k", "v"))
    df3 = sqlContext.createDataFrame([(5, "foo3"), (6, "bar3")], ("k", "v"))
    unionAll(df1, df2, df3).show()

    Spark union of multiple RDDs

    Create new column with function in Spark Dataframe

    import org.apache.spark.sql.functions._
    val myDF = sqlContext.parquetFile("hdfs:/to/my/file.parquet")
    val coder: (Int => String) = (arg: Int) => {if (arg < 100) "little" else "big"}
    val sqlfunc = udf(coder)
    myDF.withColumn("Code", sqlfunc(col("Amt")))





    1. 一个文件路径,这时候只装载指定的文件
    2. 一个目录路径,这时候只装载指定目录下面的所有文件(不包括子目录下面的文件)
    3. 通过通配符的形式加载多个文件或者加载多个目录下面的所有文件





    How to load IPython shell with PySpark

    PYSPARK_DRIVER_PYTHON=ipython /path/to/bin/pyspark

    How to load IPython shell with PySpark

    Why does Spark report “ Relative path in absolute URI” when working with DataFrames?

    spark-shell --conf spark.sql.warehouse.dir=file:///c:/tmp/spark-warehouse
    PYSPARK_DRIVER_PYTHON=ipython pyspark --jars ../spark/jars/mysql-connector-java-5.1.40.jar --conf spark.sql.warehouse.dir=/data/warehouse.dir

    Why does Spark report “ Relative path in absolute URI” when working with DataFrames?

    datetime range filter in PySpark SQL

    dates = ("2013-01-01",  "2015-07-01")
    date_from, date_to = [to_date(lit(s)).cast(TimestampType()) for s in dates]
    sf.where((sf.my_col > date_from) & (sf.my_col < date_to))

    datetime range filter in PySpark SQL

    ‘0000-00-00 00:00:00’ can not be represented as java.sql.Timestamp error

    You can use this JDBC URL directly in your data source configuration:


    Apache Spark: map vs mapPartitions?

    What’s the difference between an RDD’s map and mapPartitions method? The method map converts each element of the source RDD into a single element of the result RDD by applying a function. mapPartitions converts each partition of the source RDD into multiple elements of the result (possibly none).

    And does flatMap behave like map or like mapPartitions? Neither, flatMap works on a single element (as map) and produces multiple elements of the result (as mapPartitions).

    Apache Spark: map vs mapPartitions?

    how to use SparkContext.textFile for local file system


    spark use local file

    pyspark connect mysql

    From pySpark, it work for me :

    dataframe_mysql ="jdbc").options(
        driver = "com.mysql.jdbc.Driver",
        dbtable = "my_tablename",

    How to work with MySQL DB and Apache Spark?

    How do I add a new column to a Spark DataFrame (using PySpark)?

    from pyspark.sql.functions import lit
    df = sqlContext.createDataFrame(
        [(1, "a", 23.0), (3, "B", -23.0)], ("x1", "x2", "x3"))
    df_with_x4 = df.withColumn("x4", lit(0))
    ## +---+---+-----+---+
    ## | x1| x2|   x3| x4|
    ## +---+---+-----+---+
    ## |  1|  a| 23.0|  0|
    ## |  3|  B|-23.0|  0|
    ## +---+---+-----+---+

    How do I add a new column to a Spark DataFrame (using PySpark)?

  • scala note

    Scala partial functions

    val fraction = new PartialFunction[Int, Int] { def apply(d: Int) = 42 / d def isDefinedAt(d: Int) = d != 0 }

    List(41, “cat”) map { case i: Int ⇒ i + 1 }

    Scala partial functions

    sbt 加速

    ~/.sbt/repositories 文件,加入


    How to read Scala command line arguments

    println("Hello, " + args(0))
    >>> scala hello.scala Al
    >>> Hello, Al

    java properties file location

    The typical way of handling this is to load the base properties from your embedded file, and allow users of the application to specify an additional file with overrides. Some pseudocode:

    Properties p = new Properties();
    InputStream in = this.getClass().getResourceAsStream("");
    String externalFileName = System.getProperty("");
    InputStream fin = new FileInputStream(new File(externalFileName));

    Your program would be invoked similar to this:

    java -jar app.jar"/path/to/custom/"

    how to read properties file Java Properties file examples

    sbt deduplicate: different file contents found in the following

    % "provided"

    将相同的jar中排除一个,因为重复,可以使用”provided”关键字。 例如spark是一个容器类,编写spark应用程序我们需要spark core jar. 但是真正打包提交到集群上执行,则不需要将它打入jar包内。 这是我们使用 % “provided” 关键字来exclude它。

    libraryDependencies ++= Seq(
      "org.apache.spark" %% "spark-core" % "0.8.0-incubating" % "provided",
      "org.apache.hadoop" % "hadoop-client" % "2.0.0-cdh4.4.0" % "provided"

    different file contents found in the following

    Include Dependencies in JAR using SBT package

    For sbt 0.13.6+ add sbt-assembly as a dependency in project/assembly.sbt:

    addSbtPlugin("com.eed3si9n" % "sbt-assembly" % "0.14.3")

    For example, here’s a multi-project build.sbt:

    lazy val commonSettings = Seq(
      version := "0.1-SNAPSHOT",
      organization := "com.example",
      scalaVersion := "2.10.1",
      test in assembly := {}
    lazy val app = (project in file("app")).
      settings(commonSettings: _*).
        mainClass in assembly := Some("com.example.Main"),
        // more settings here ...
    lazy val utils = (project in file("utils")).
      settings(commonSettings: _*).
        assemblyJarName in assembly := "utils.jar",
        // more settings here ...
      sbt clean assembly // 打包

    sbt-assembly how do I get sbt to gather all the jar files my code depends on into one place?

  • python note

    Display an image from a file in an IPython Notebook

    from IPython.display import Image Image(filename=’test.png’)

    In Python, how do I determine if an object is iterable?

    try: some_object_iterator = iter(some_object) except TypeError, te: print some_object, ‘is not iterable’

    try: _ = (e for e in my_object) except TypeError: print my_object, ‘is not iterable’

    import collections

    if isinstance(e, collections.Iterable): # e is iterable

    Best way to parse a URL query string

    from urlparse import urlparse, parse_qs URL=’’ parsed_url = urlparse(URL) parse_qs(parsed_url.query) {‘i’: [‘main’], ‘enc’: [’ Hello ‘], ‘mode’: [‘front’], ‘sid’: [‘12ab’]}

    modue six, urllib.parse

    string format

    content = u'''
            <td> 响应时间平均值 </td>
            <td> {mean:.4f} </td>
            <td> 响应时间75% </td>
            <td> {seventy_five:.4f} </td>
            <td> 响应时间95% </td>
            <td> {ninety_five:.4f} </td>
            <td> 可用性 </td>
            <td> {ava} </td>
        'mean': df.mean()[0],
        'seventy_five': df.quantile(.75)[0],
        'ninety_five': df.quantile(.95)[0],
        'ava': ava

    Using % and .format() for great good!

    How can I use Python to get the system hostname?

    import socket
    import platform

    Python: defaultdict of defaultdict?

    defaultdict(lambda : defaultdict(int))

    Python: defaultdict of defaultdict

    get random string

    Generating strings from (for example) lowercase characters:

    import random, string
    def randomword(length):
       return ''.join(random.choice(string.lowercase) for i in range(length))
    >>> randomword(10)
    >>> randomword(10)

    get random string

  • pandas note

    iter rows over dataframe

    >>> df = pd.DataFrame([[1, 1.5]], columns=['int', 'float'])
    >>> row = next(df.iterrows())[1]
    >>> row
    int      1.0
    float    1.5
    Name: 0, dtype: float64
    >>> print(row['int'].dtype)
    >>> print(df['int'].dtype)